foul-ball-analysis.knit

Introduction

# Setup
library(baseballr)
library(DBI)
library(bigrquery)
library(dplyr)
library(scales)
library(plotly)
library(knitr)
library(kableExtra)
library(ggimage)
library(jsonlite)

# BigQuery
bq <- dbConnect(bigrquery::bigquery(),
                project = "pjb-sports-data",
                dataset = "mlb")

# Load Statcast Data
statcast_leaderboard <- list()
for (player_type in c("batter", "pitcher")) {
  statcast_leaderboard[[player_type]] <- statcast_leaderboards(
    leaderboard = "expected_statistics", year = 2023, min_pa = 1,
    player_type = player_type
  )
}

We talk often about player tendencies to achieve certain outcomes… a hitter’s spray chart, a hitter’s launch angle breakdown, a pitcher’s pitch distribution, a pitcher’s fly ball to ground ball ratio, etc. However, there is a common event that often gets overlooked in baseball analysis: the foul ball. Every now and then, a stat about foul balls will emerge, such as how Joey Votto pulled one foul ball into the seats over the first 2,138 plate appearances of his career, but more attention is lent to the pitches that are put in play, watched or whiffed at.

'
SELECT
  description,
  count / total `%`
FROM
  (
    SELECT
      1 foo,
      description,
      COUNT(*) count
    FROM
      (
        SELECT
          REGEXP_REPLACE(description, "_*blocked_*", "") description,
        FROM
          `mlb.statcast_pitches`
        WHERE
          game_year = 2023 AND game_type = "R"
      )
    GROUP BY
      description
  )
JOIN
  (
    SELECT
      1 foo,
      COUNT(*) total
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R"
  )
USING
  (foo);
' %>%
  dbGetQuery(bq, .) %>%
  ggplot(aes(reorder(description, -`%`), `%`)) +
  geom_bar(stat = "identity", fill = "#0099f9") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "2023 Pitch Results", x = "", y = "Frequency") +
  scale_y_continuous(labels = percent, limits = c(0, 0.4)) +
  geom_text(aes(label = percent(`%`, accuracy = 0.1)), vjust = -0.5, size = 3)

Foul balls make up nearly 18% of pitch outcomes and can be interpreted in a number of different ways. With less than two strikes…

the pitch wasn’t really what the batter was looking for, but he swung anyway.
the batter’s swing or timing was slightly off.

With two strikes…

the batter is just trying to stay alive with two strikes.
the pitch wasn’t located well enough or wasn’t deceptive enough to miss the bat entirely.

'
WITH swing_results_by_year AS
  (
    SELECT
      game_year,
      CASE description WHEN "foul" THEN description
                       WHEN "hit_into_play" THEN description
                       ELSE "swinging_strike" END swing_result,
      COUNT(*) pitch_count
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_type = "R" AND description IN ("foul", "hit_into_play", "swinging_strike",
                                          "foul_tip", "swinging_strike_blocked")
    GROUP BY
      game_year,
      swing_result
  )

SELECT
  game_year,
  swing_result,
  pitch_count / total_pitches `%`
FROM
  swing_results_by_year
JOIN
  (
    SELECT
      game_year,
      SUM(pitch_count) total_pitches
    FROM
      swing_results_by_year
    GROUP BY
      game_year
  )
USING
  (game_year);
' %>%
  dbGetQuery(bq, .) %>%
  ggplot(aes(x = game_year, y = `%`, label = `%`,
             fill = factor(swing_result, levels = c("swinging_strike", "hit_into_play",
                                                    "foul")))) +
  geom_bar(stat = "identity") +
  scale_x_continuous(breaks = seq(2016, 2023, 1)) +
  scale_y_continuous(labels = percent) +
  geom_text(aes(label = percent(`%`, accuracy = 0.1)),
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "Swing Results: 2016-2023", x = "", y = "", fill = "result")

The last eight seasons have seen some variety in terms of the percentage of swings that have resulted in swinging strikes and balls hit into play, but the foul ball rate has stayed fairly steady, with the exception of 2020, which had a much smaller sample size.

'
WITH swing_results_by_count AS
  (
    SELECT
      balls,
      strikes,
      CASE description WHEN "foul" THEN description
                       WHEN "hit_into_play" THEN description
                       ELSE "swinging_strike" END swing_result,
      COUNT(*) pitch_count
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND balls < 4 AND
      description IN ("foul", "hit_into_play", "swinging_strike",
                      "foul_tip", "swinging_strike_blocked")
    GROUP BY
      balls,
      strikes,
      swing_result
  )

SELECT
  balls,
  strikes,
  CONCAT(CAST(balls as STRING), "-", CAST(strikes as STRING)) count,
  swing_result,
  pitch_count / total_pitches `%`
FROM
  swing_results_by_count
JOIN
  (
    SELECT
      balls,
      strikes,
      SUM(pitch_count) total_pitches
    FROM
      swing_results_by_count
    GROUP BY
      balls,
      strikes
  )
USING
  (balls, strikes);
' %>%
  dbGetQuery(bq, .) %>%
  ggplot(aes(x = count, y = `%`, label = `%`,
             fill = factor(swing_result, levels = c("swinging_strike", "hit_into_play",
                                                    "foul")))) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = percent) +
  geom_text(aes(label = percent(`%`, accuracy = 0.1)),
            position = position_stack(vjust = 0.5), size = 3) +
  labs(title = "2023 Swing Results by Count", x = "Count", y = "", fill = "result")

As batters get deeper into counts, they seem to become less likely to whiff and more likely to put the ball in play or foul it off. This makes sense considering that pitchers can have an element of surprise against hitters who have not seen many of their pitches. Hitters are known to learn from pitch to pitch, hone in on what a pitcher has to offer and adjust to the pitcher’s plan of attack.

Fouls can feel like a neutral outcome when they happen on the field, but they are almost always a positive outcome for either the pitcher or the hitter. For example, with less than two strikes, a foul ball is a positive outcome for the pitcher. Certain foul balls, like a towering drive that hooks foul, can indicate that the batter has the upper hand, but it still counts like any other strike, and the pitcher can breath a sigh of relief. When there are two strikes, a foul ball is always going to feel like the batter held strong, and the pitcher is annoyed that he has to throw another pitch. In short, a foul ball is a win for the pitcher with less than 2 strikes, and it is a win for the hitter with 2 strikes. With this in mind, I wanted to split hitter and pitcher foul ball tendencies based on the count.

batters.2023.df <- statcast_leaderboard$batter %>%
  filter(pa >= 200) %>%
  mutate(Name = sub("(.+),\\s(.+)","\\2 \\1", `last_name, first_name`)) %>%
  select(-year, -`last_name, first_name`, -est_ba_minus_ba_diff,
         -est_slg_minus_slg_diff, -est_woba_minus_woba_diff) %>%
  rename(PA = pa) %>%
  # All pitches
  inner_join(
    '
    SELECT
      batter player_id,
      COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
                / COUNT(*) `0|1-Strike Whiff %`,
      COUNTIF(description = "foul") / COUNT(*) `0|1-Strike Foul %`
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND strikes < 2 AND
      description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
                      "swinging_strike_blocked") /* Swing */
    GROUP BY
      batter;
    ' %>%
      dbGetQuery(bq, .),
    by = "player_id"
  ) %>%
  # 2-Strikes
  inner_join(
    '
    SELECT
      batter player_id,
      COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
        / COUNT(*) `2-Strike Whiff %`,
      COUNTIF(description = "foul") / COUNT(*) `2-Strike Foul %`
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND strikes = 2 AND
      description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
                      "swinging_strike_blocked") /* Swing */
    GROUP BY
      batter;
    ' %>%
    dbGetQuery(bq, .),
    by = "player_id"
  ) %>%
  rename(`0/1-Strike Whiff %` = `0|1-Strike Whiff %`,
         `0/1-Strike Foul %` = `0|1-Strike Foul %`)

pitchers.2023.df <- statcast_leaderboard$pitcher %>%
  filter(pa >= 200) %>%
  mutate(Name = sub("(.+),\\s(.+)","\\2 \\1", `last_name, first_name`)) %>%
  select(-year, -`last_name, first_name`, -est_ba_minus_ba_diff,
         -est_slg_minus_slg_diff, -est_woba_minus_woba_diff) %>%
  rename(TBF = pa) %>%
  # All pitches
  inner_join(
    '
    SELECT
      pitcher player_id,
      COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
        / COUNT(*) `0|1-Strike Whiff %`,
      COUNTIF(description = "foul") / COUNT(*) `0|1-Strike Foul %`
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND strikes < 2 AND
      description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
                      "swinging_strike_blocked") /* Swing */
    GROUP BY
      pitcher;
    ' %>%
      dbGetQuery(bq, .),
    by = "player_id"
  ) %>%
  # 2-Strikes
  inner_join(
    '
    SELECT
      pitcher player_id,
      COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
        / COUNT(*) `2-Strike Whiff %`,
      COUNTIF(description = "foul") / COUNT(*) `2-Strike Foul %`
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND strikes = 2 AND
      description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
                      "swinging_strike_blocked") /* Swing */
    GROUP BY
      pitcher;
    ' %>%
      dbGetQuery(bq, .),
    by = "player_id"
  ) %>%
  rename(`0/1-Strike Whiff %` = `0|1-Strike Whiff %`,
         `0/1-Strike Foul %` = `0|1-Strike Foul %`)

2023 Batters (min. 200 PAs)

First, I want to highlight the extremes: the guys who hit a ton of foul balls and the guys who hit very few.

0/1-Strike Foul %

2-Strike Foul %

10 Highest

batters.2023.df %>%
  select(Name, PA, `0/1-Strike Foul %`) %>%
  arrange(desc(`0/1-Strike Foul %`)) %>%
  mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	PA	0/1-Strike Foul %
Jake Cronenworth	522	48.78%
Isaac Paredes	571	48.43%
Bo Naylor	230	46.54%
Ozzie Albies	660	45.89%
Nathaniel Lowe	724	45.59%
Gio Urshela	228	45.05%
Zach McKinstry	518	45.03%
Max Kepler	491	44.52%
Adley Rutschman	687	44.26%
Pavin Smith	228	44.20%

batters.2023.df %>%
  select(Name, PA, `2-Strike Foul %`) %>%
  arrange(desc(`2-Strike Foul %`)) %>%
  mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	PA	2-Strike Foul %
Yordan Alvarez	496	46.95%
Sal Frelick	223	46.75%
Geraldo Perdomo	495	46.02%
Justin Turner	626	45.81%
Ty France	665	45.59%
Isiah Kiner-Falefa	361	45.55%
Cody Bellinger	556	45.43%
Daulton Varsho	581	45.11%
Will Smith	554	44.87%
Alec Burleson	347	44.87%

10 Lowest

batters.2023.df %>%
  select(Name, PA, `0/1-Strike Foul %`) %>%
  arrange(`0/1-Strike Foul %`) %>%
  mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	PA	0/1-Strike Foul %
Christian Bethancourt	332	29.48%
William Contreras	611	29.57%
Eloy Jiménez	489	29.92%
Javier Báez	547	30.26%
Kevin Kiermaier	408	30.29%
Aaron Judge	458	30.75%
Jose Siri	364	30.82%
Joey Wiemer	410	30.87%
Jordan Walker	465	30.91%
Luke Raley	406	31.07%

batters.2023.df %>%
  select(Name, PA, `2-Strike Foul %`) %>%
  arrange(`2-Strike Foul %`) %>%
  mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	PA	2-Strike Foul %
Jazz Chisholm Jr.	383	23.53%
Mark Vientos	233	25.32%
Jose Siri	364	27.57%
Eloy Jiménez	489	27.94%
Harrison Bader	344	28.64%
Brett Baty	389	28.69%
Francisco Alvarez	423	28.87%
Christopher Morel	429	29.08%
Mickey Moniak	323	29.51%
Paul DeJong	400	29.64%

# https://plotly.com/ggplot2/configuration-options/
# Highlight POIs: https://thiyanga.netlify.app/post/scatterplot/
batter.scatter.plot.df <- batters.2023.df %>%
  mutate(diff = `2-Strike Foul %` - `0/1-Strike Foul %`,
         color = as.factor(
           ifelse(diff >= 0.06, "#025189",
                  ifelse(diff >= 0.03, "#0c9cb4",
                         ifelse(diff >= 0, "#94c280",
                                ifelse(diff >= -0.03, "#f1c359",
                                       ifelse(diff >= -0.06, "#d03f2e",
                                              "#982123"))))))) %>%
  select(Name, `2-Strike Foul %`, `0/1-Strike Foul %`, color)

(batter.scatter.plot.df %>%
  ggplot(
    aes(x = `0/1-Strike Foul %`, y = `2-Strike Foul %`,
        text = paste(Name, "\n0/1-Strike Foul %: ",
                     percent(`0/1-Strike Foul %`, accuracy = 0.1),
                     "\n2-Strike Foul %: ",
                     percent(`2-Strike Foul %`, accuracy = 0.1), sep = ""))) +
    geom_point(aes(color = color)) +
    scale_colour_manual(values = levels(batter.scatter.plot.df$color)) +
    labs(title = "2023 Batters (min. 200 PAs)") +
    scale_x_continuous(labels = percent) +
    scale_y_continuous(labels = percent) +
    geom_abline(linetype = "dotted") +
    theme(legend.position = "none")) %>%
  ggplotly(tooltip = "text") %>%
  layout(annotations = list(
    list(text = "Hover over any point for player details", x = 0.437, y = 0.462,
         font = list(size = 10)),
    list(text = "y = x", x = 0.484, y = 0.476, font = list(size = 10),
         showarrow = FALSE))) %>%
  config(displayModeBar = FALSE)

There is a positive correlation between 0/1-Strike Foul % and 2-Strike Foul % (correlation coefficient of 0.42), but substantial variance exists, too. Take Yordan Alvarez, for example. With less than 2 strikes, he fouls off only 34.1% of pitches, but that number increases to 46.9% with 2 strikes, which is the highest in all of MLB.

(
  'WITH
     team_foul_pct
   AS
     (
       SELECT
         CASE inning_topbot WHEN "top" THEN away_team ELSE home_team END team,
         strikes,
         COUNTIF(description = "foul") fouls,
         COUNT(*) pitches,
       FROM
         `mlb.statcast_pitches`
       WHERE
         game_year = 2023 AND game_type = "R" AND
         description IN ("foul", "hit_into_play", "swinging_strike",
                         "foul_tip", "swinging_strike_blocked") /* Swing */
       GROUP BY
         team,
         strikes
     )
  
   SELECT
     team,
     `0|1-Strike Foul %`,
     `2-Strike Foul %`
   FROM
     (
       SELECT
         team,
         SUM(fouls) / SUM(pitches) `0|1-Strike Foul %`
       FROM
         team_foul_pct
       WHERE
         strikes < 2
       GROUP BY
         team
     )
   JOIN
     (
       SELECT
         team,
         SUM(fouls) / SUM(pitches) `2-Strike Foul %`
       FROM
         team_foul_pct
       WHERE
         strikes = 2
       GROUP BY
         team
     )
   USING
     (team);' %>%
    dbGetQuery(bq, .) %>%
    rename(`0/1-Strike Foul %` = `0|1-Strike Foul %`) %>%
    left_join(fromJSON("https://statsapi.mlb.com/api/v1/teams?lang=en&sportId=1&season=2023")$teams %>%
                mutate(logo = paste("https://www.mlbstatic.com/team-logos/", id, ".svg", sep = "")) %>%
                select(abbreviation, logo),
              by = c("team" = "abbreviation")) %>%
    ggplot(aes(x = `0/1-Strike Foul %`, y = `2-Strike Foul %`,
               text = paste(team, "\n0/1-Strike Foul %: ",
                            percent(`0/1-Strike Foul %`, accuracy = 0.1),
                            "\n2-Strike Foul %: ",
                            percent(`2-Strike Foul %`, accuracy = 0.1), sep = ""))) +
    geom_image(aes(image = logo), size = 0.075, by = "height") +
    labs(title = "2023 Teams (Batting)") +
    scale_x_continuous(labels = percent) +
    scale_y_continuous(labels = percent) +
    geom_abline(linetype = "dotted") +
    theme(legend.position = "none")) #%>%

# ggplotly(tooltip = "text") %>%
# layout(annotations = list(
#   list(text = "Hover over any point for team details", x = 0.377, y = 0.3965,
#        font = list(size = 10)),
#   list(text = "y = x", x = 0.388, y = 0.39, font = list(size = 10),
#        showarrow = FALSE))) %>%
# config(displayModeBar = FALSE)

It is surprising to see that the top 2 run-scoring offenses in 2023 (Braves and Dodgers) are both below the dotted line here. Meanwhile, struggling offenses like the Royals and A’s both have a much higher 2-Strike Foul % than 0/1-Strike Foul %.

2023 Pitchers (min. 200 batters faced)

Similar to how certain hitters are much more foul ball prone than others, pitchers can be the same way.

0/1-Strike Foul %

2-Strike Foul %

10 Highest

pitchers.2023.df %>%
  select(Name, TBF, `0/1-Strike Foul %`) %>%
  arrange(desc(`0/1-Strike Foul %`)) %>%
  mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	TBF	0/1-Strike Foul %
Brusdar Graterol	257	45.57%
Chris Murphy	212	45.31%
Luis Severino	417	44.93%
Johnny Cueto	218	44.36%
Nestor Cortes	266	44.34%
Steven Wilson	219	44.21%
Joe Ryan	672	44.04%
Brad Hand	236	44.00%
Louie Varland	283	43.34%
Cody Bradford	234	43.08%

pitchers.2023.df %>%
  select(Name, TBF, `2-Strike Foul %`) %>%
  arrange(desc(`2-Strike Foul %`)) %>%
  mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	TBF	2-Strike Foul %
Ron Marinaccio	205	51.41%
Josh Hader	231	47.62%
Sean Manaea	499	46.85%
Drew Smith	244	46.73%
Reynaldo López	278	46.33%
Brock Burke	250	46.07%
Johnny Cueto	218	45.71%
Kyle Muller	372	45.66%
Jhony Brito	372	45.59%
Jake Irvin	530	45.57%

10 Lowest

pitchers.2023.df %>%
  select(Name, TBF, `0/1-Strike Foul %`) %>%
  arrange(`0/1-Strike Foul %`) %>%
  mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	TBF	0/1-Strike Foul %
Robert Stephenson	201	25.29%
Elvis Peguero	252	25.82%
Andrés Muñoz	211	27.80%
Giovanny Gallegos	229	28.28%
Alex Lange	288	28.71%
Josh Sborz	215	28.92%
Gregory Santos	289	28.96%
Bryan Abreu	287	29.15%
Josh Fleming	221	29.39%
Alex Young	236	29.79%

pitchers.2023.df %>%
  select(Name, TBF, `2-Strike Foul %`) %>%
  arrange(`2-Strike Foul %`) %>%
  mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("bordered", "striped"))

Name	TBF	2-Strike Foul %
Alex Lange	288	25.00%
Yency Almonte	207	25.86%
Phil Maton	274	26.63%
Mark Leiter Jr.	269	26.83%
Alexis Díaz	286	27.66%
Quinn Priester	234	28.00%
Jordan Romano	248	28.12%
Gregory Soto	250	28.26%
Albert Abreu	268	28.66%
Drew VerHagen	268	28.82%

# https://plotly.com/ggplot2/configuration-options/
# Highlight POIs: https://thiyanga.netlify.app/post/scatterplot/
pitcher.scatter.plot.df <- pitchers.2023.df %>%
  mutate(diff = `2-Strike Foul %` - `0/1-Strike Foul %`,
         color = as.factor(
           ifelse(diff >= 0.06, "#982123",
                  ifelse(diff >= 0.03, "#d03f2e",
                         ifelse(diff >= 0, "#f1c359",
                                ifelse(diff >= -0.03, "#94c280",
                                       ifelse(diff >= -0.06, "#0c9cb4",
                                              "#025189"))))))) %>%
  select(Name, `2-Strike Foul %`, `0/1-Strike Foul %`, color)

(pitcher.scatter.plot.df %>%
  ggplot(
    aes(x = `0/1-Strike Foul %`, y = `2-Strike Foul %`,
        text = paste(Name, "\n0/1-Strike Foul %: ",
                     percent(`0/1-Strike Foul %`, accuracy = 0.1),
                     "\n2-Strike Foul %: ",
                     percent(`2-Strike Foul %`, accuracy = 0.1), sep = ""))) +
    geom_point(aes(color = color)) +
    scale_colour_manual(values = levels(pitcher.scatter.plot.df$color)) +
    labs(title = "2023 Pitchers (min. 200 batters faced)") +
    scale_x_continuous(labels = percent) +
    scale_y_continuous(labels = percent) +
    geom_abline(linetype = "dotted") +
    theme(legend.position = "none")) %>%
  ggplotly(tooltip = "text") %>%
  layout(annotations = list(
    list(text = "Hover over any point for player details", x = 0.316, y = 0.444,
         font = list(size = 10)),
    list(text = "y = x", x = 0.461, y = 0.47, font = list(size = 10),
         showarrow = FALSE))) %>%
  config(displayModeBar = FALSE)

Do foul ball tendencies relate to K%

Foul balls never end an at bat, but they can have an affect on the eventual outcome of an at bat, either by adding a strike to the count or keeping the batter alive for another pitch. While I was not able to see much of a relationship between 0/1-Strike Foul % and wOBA or 2-Strike Foul % and wOBA, there were some noteworthy trends when comparing the Foul % stats to K%.

# Add K%
batters.2023.df <- batters.2023.df %>%
  select(-starts_with("K")) %>%
  left_join(
    '
    SELECT
      batter player_id,
      COUNT(*) K
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND
      events IN("strikeout", "strikeout_double_play")
    GROUP BY
      batter;
    ' %>%
      dbGetQuery(bq, .), by = "player_id") %>%
  mutate(`K%` = K / PA)

k.pct.lm.0.1 <- lm(`K%` ~ `0/1-Strike Foul %`, data = batters.2023.df)
k.pct.lm.2 <- lm(`K%` ~ `2-Strike Foul %`, data = batters.2023.df)

batters.2023.df %>%
  ggplot(aes(x = `0/1-Strike Foul %`, y = `K%`,
             text = paste(Name, "\n0/1-Strike Foul %: ",
                          percent(`0/1-Strike Foul %`, accuracy = 0.1), "\nK%: ",
                          percent(`K%`, accuracy = 0.1), sep = ""))) +
  geom_point(color = "#0c9cb4") +
  labs(title = "2023 Batters (min. 200 PAs)") +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  theme(legend.position = "none") +
  geom_abline(slope = k.pct.lm.0.1$coefficients[2],
              intercept = k.pct.lm.0.1$coefficients[1], color = "#BD9B60",
              linewidth = 0.5) +
  annotate("label", x = 0.45, y = 0.4, hjust = 0,
           label = paste("r =", round(cor(batters.2023.df$`K%`,
                                          batters.2023.df$`0/1-Strike Foul %`), 2)))

batters.2023.df %>%
  ggplot(aes(x = `2-Strike Foul %`, y = `K%`,
             text = paste(Name, "\n2-Strike Foul %: ",
                          percent(`2-Strike Foul %`, accuracy = 0.1), "\nK%: ",
                          percent(`K%`, accuracy = 0.1), sep = ""))) +
  geom_point(color = "#0c9cb4") +
  labs(title = "2023 Batters (min. 200 PAs)") +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  theme(legend.position = "none") +
  geom_abline(slope = k.pct.lm.2$coefficients[2],
              intercept = k.pct.lm.2$coefficients[1], color = "#BD9B60",
              linewidth = 0.5) +
  annotate("label", x = 0.45, y = 0.4, hjust = 0,
           label = paste("r =", round(cor(batters.2023.df$`K%`,
                                          batters.2023.df$`2-Strike Foul %`), 2)))

Not surprisingly, there is a trend that batters whose 2-strike swings often result in fouls tend to strike out less often. However, I am surprised that frequent foul hitters with less than two strikes also tend to strike out less. With those types of fouls, a batter is getting one strike closer to a potential strikeout, but as we learned earlier, these same guys tend to also foul off lots of two-strike pitches and stay alive.

While these trends are notable for batters, we see much less correlation between foul balls and strikeouts with pitchers:

# Add K%
pitchers.2023.df <- pitchers.2023.df %>%
  select(-starts_with("K")) %>%
  left_join(
    '
    SELECT
      pitcher player_id,
      COUNT(*) K
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND
      events IN("strikeout", "strikeout_double_play")
    GROUP BY
      pitcher;
    ' %>%
      dbGetQuery(bq, .), by = "player_id") %>%
  mutate(`K%` = K / TBF)

k.pct.lm.0.1 <- lm(`K%` ~ `0/1-Strike Foul %`, data = pitchers.2023.df)
k.pct.lm.2 <- lm(`K%` ~ `2-Strike Foul %`, data = pitchers.2023.df)

pitchers.2023.df %>%
  ggplot(aes(x = `0/1-Strike Foul %`, y = `K%`,
             text = paste(Name, "\n0/1-Strike Foul %: ",
                          percent(`0/1-Strike Foul %`, accuracy = 0.1), "\nK%: ",
                          percent(`K%`, accuracy = 0.1), sep = ""))) +
  geom_point(color = "#d03f2e") +
  labs(title = "2023 Pitchers (min. 200 batters faced)") +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  theme(legend.position = "none") +
  geom_abline(slope = k.pct.lm.0.1$coefficients[2],
              intercept = k.pct.lm.0.1$coefficients[1], color = "#BD9B60",
              linewidth = 0.5) +
  annotate("label", x = 0.425, y = 0.4, hjust = 0,
           label = paste("r =", round(cor(pitchers.2023.df$`K%`,
                                          pitchers.2023.df$`0/1-Strike Foul %`), 2)))

pitchers.2023.df %>%
  ggplot(aes(x = `2-Strike Foul %`, y = `K%`,
             text = paste(Name, "\n2-Strike Foul %: ",
                          percent(`2-Strike Foul %`, accuracy = 0.1), "\nK%: ",
                          percent(`K%`, accuracy = 0.1), sep = ""))) +
  geom_point(color = "#d03f2e") +
  labs(title = "2023 Pitchers (min. 200 batters faced)") +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  theme(legend.position = "none") +
  geom_abline(slope = k.pct.lm.2$coefficients[2],
              intercept = k.pct.lm.2$coefficients[1], color = "#BD9B60",
              linewidth = 0.5) +
  annotate("label", x = 0.45, y = 0.4, hjust = 0,
           label = paste("r =", round(cor(pitchers.2023.df$`K%`,
                                          pitchers.2023.df$`2-Strike Foul %`), 2)))

Hard hit foul balls?

Since Statcast still tracks the exit velocity of foul balls, I wanted to examine not just the frequency with which players hit foul balls, but also how hard they are hitting them (Avg. EV and Hard Hit %)

(batters.2023.df %>%
  left_join(
    '
    SELECT
      batter player_id,
    SAFE_DIVIDE(COUNTIF(launch_speed >= 95 AND description = "hit_into_play"),
                COUNTIF(description = "hit_into_play")) `Hard-Hit %`,
    SAFE_DIVIDE(COUNTIF(launch_speed >= 95 AND description = "foul"),
                COUNTIF(description = "foul")) `Foul Balls Hard-Hit %`
    FROM
      `mlb.statcast_pitches`
    WHERE
      game_year = 2023 AND game_type = "R" AND
      launch_speed IS NOT NULL
    GROUP BY
      batter;
    ' %>%
      dbGetQuery(bq, .) %>%
      rename(`Fair Balls: Hard-Hit %` = `Hard-Hit %`,
             `Foul Balls: Hard-Hit %` = `Foul Balls Hard-Hit %`),
    by = "player_id") %>%
  ggplot(aes(x = `Foul Balls: Hard-Hit %`, y = `Fair Balls: Hard-Hit %`,
           text = paste(Name, "\nFoul Balls Hard-Hit %: ",
                        percent(`Foul Balls: Hard-Hit %`, accuracy = 0.1),
                        "\nFair Balls Hard-Hit %: ",
                        percent(`Fair Balls: Hard-Hit %`, accuracy = 0.1),
                        sep = ""))) +
  geom_point(color = "#0c9cb4") +
  labs(title = "2023 Batters (min. 200 PAs)") +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  theme(legend.position = "none") +
  geom_abline(linetype = "dotted")) %>%
  ggplotly(tooltip = "text") %>%
  layout(annotations = list(
    list(text = "Hover over any point for player details", x = 0.171, y = 0.492,
         font = list(size = 10)))) %>%
  config(displayModeBar = FALSE)

Obviously, no players are going to have a higher hard-hit rate (95 mph off the bat or higher) on foul balls than fair balls, but some players do waste their solid contact by spraying lasers into foul territory. Isaac Paredes had 2023’s smallest disparity between hard-hit rate on fair balls and foul balls. Given that he has pulled every single one of his 53 career homers to left field, it is easy to envision him catching some balls on the barrel but too far out in front.

Main Takeways and Future Explorations

In the end, we see successful hitters and pitchers all over the map in terms of how often and when they hit/give up foul balls. In general, hitters whose swings result in a lot of foul balls tend to strike out less, but the trends are not strong enough to declare that fouling off more pitches should be a key part of a hitter’s approach. This investigation was never meant to uncover a new key metric for hitters and pitchers. Instead, I wanted to try and learn something from a frequent pitch outcome that is infrequently discussed. The main takeaway is that there are some players of interest whose approaches are worth examining further after seeing their tendencies relating to foul balls.

Yordan Alvarez reputation as a bruising slugger makes it surprising that he is excellent at spoiling two-strike pitches and prolonging at bats. On the flip side, Jazz Chisholm, Jr. is a young star who hits very few foul balls and shows a lot of promise at the plate. One pitcher that stood out in this study is Alexis Diaz. The young Reds closer sees a 13.4% decrease in Foul % on two-strike swings versus swings with less than two strikes. Given his recent success and ability to miss bats, it would be interesting to know what he does differently in potential strikeout scenarios.

dbDisconnect(bq)