Introduction
# Setup
library(baseballr)
library(DBI)
library(bigrquery)
library(dplyr)
library(scales)
library(plotly)
library(knitr)
library(kableExtra)
library(ggimage)
library(jsonlite)
# BigQuery
bq <- dbConnect(bigrquery::bigquery(),
project = "pjb-sports-data",
dataset = "mlb")
# Load Statcast Data
statcast_leaderboard <- list()
for (player_type in c("batter", "pitcher")) {
statcast_leaderboard[[player_type]] <- statcast_leaderboards(
leaderboard = "expected_statistics", year = 2023, min_pa = 1,
player_type = player_type
)
}
We talk often about player tendencies to achieve certain outcomes… a
hitter’s spray chart, a hitter’s launch angle breakdown, a pitcher’s
pitch distribution, a pitcher’s fly ball to ground ball ratio, etc.
However, there is a common event that often gets overlooked in baseball
analysis: the foul ball. Every now and then, a stat about foul balls
will emerge, such as how Joey
Votto pulled one foul ball into the seats over the first 2,138 plate
appearances of his career, but more attention is lent to the pitches
that are put in play, watched or whiffed at.
'
SELECT
description,
count / total `%`
FROM
(
SELECT
1 foo,
description,
COUNT(*) count
FROM
(
SELECT
REGEXP_REPLACE(description, "_*blocked_*", "") description,
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R"
)
GROUP BY
description
)
JOIN
(
SELECT
1 foo,
COUNT(*) total
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R"
)
USING
(foo);
' %>%
dbGetQuery(bq, .) %>%
ggplot(aes(reorder(description, -`%`), `%`)) +
geom_bar(stat = "identity", fill = "#0099f9") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "2023 Pitch Results", x = "", y = "Frequency") +
scale_y_continuous(labels = percent, limits = c(0, 0.4)) +
geom_text(aes(label = percent(`%`, accuracy = 0.1)), vjust = -0.5, size = 3)
Foul balls make up nearly 18% of pitch outcomes and can be
interpreted in a number of different ways. With less than two
strikes…
- the pitch wasn’t really what the batter was looking for, but he
swung anyway.
- the batter’s swing or timing was slightly off.
With two strikes…
- the batter is just trying to stay alive with two strikes.
- the pitch wasn’t located well enough or wasn’t deceptive enough to
miss the bat entirely.
'
WITH swing_results_by_year AS
(
SELECT
game_year,
CASE description WHEN "foul" THEN description
WHEN "hit_into_play" THEN description
ELSE "swinging_strike" END swing_result,
COUNT(*) pitch_count
FROM
`mlb.statcast_pitches`
WHERE
game_type = "R" AND description IN ("foul", "hit_into_play", "swinging_strike",
"foul_tip", "swinging_strike_blocked")
GROUP BY
game_year,
swing_result
)
SELECT
game_year,
swing_result,
pitch_count / total_pitches `%`
FROM
swing_results_by_year
JOIN
(
SELECT
game_year,
SUM(pitch_count) total_pitches
FROM
swing_results_by_year
GROUP BY
game_year
)
USING
(game_year);
' %>%
dbGetQuery(bq, .) %>%
ggplot(aes(x = game_year, y = `%`, label = `%`,
fill = factor(swing_result, levels = c("swinging_strike", "hit_into_play",
"foul")))) +
geom_bar(stat = "identity") +
scale_x_continuous(breaks = seq(2016, 2023, 1)) +
scale_y_continuous(labels = percent) +
geom_text(aes(label = percent(`%`, accuracy = 0.1)),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "Swing Results: 2016-2023", x = "", y = "", fill = "result")
The last eight seasons have seen some variety in terms of the
percentage of swings that have resulted in swinging strikes and balls
hit into play, but the foul ball rate has stayed fairly steady, with the
exception of 2020, which had a much smaller sample size.
'
WITH swing_results_by_count AS
(
SELECT
balls,
strikes,
CASE description WHEN "foul" THEN description
WHEN "hit_into_play" THEN description
ELSE "swinging_strike" END swing_result,
COUNT(*) pitch_count
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND balls < 4 AND
description IN ("foul", "hit_into_play", "swinging_strike",
"foul_tip", "swinging_strike_blocked")
GROUP BY
balls,
strikes,
swing_result
)
SELECT
balls,
strikes,
CONCAT(CAST(balls as STRING), "-", CAST(strikes as STRING)) count,
swing_result,
pitch_count / total_pitches `%`
FROM
swing_results_by_count
JOIN
(
SELECT
balls,
strikes,
SUM(pitch_count) total_pitches
FROM
swing_results_by_count
GROUP BY
balls,
strikes
)
USING
(balls, strikes);
' %>%
dbGetQuery(bq, .) %>%
ggplot(aes(x = count, y = `%`, label = `%`,
fill = factor(swing_result, levels = c("swinging_strike", "hit_into_play",
"foul")))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = percent) +
geom_text(aes(label = percent(`%`, accuracy = 0.1)),
position = position_stack(vjust = 0.5), size = 3) +
labs(title = "2023 Swing Results by Count", x = "Count", y = "", fill = "result")
As batters get deeper into counts, they seem to become less likely to
whiff and more likely to put the ball in play or foul it off. This makes
sense considering that pitchers can have an element of surprise against
hitters who have not seen many of their pitches. Hitters are known to
learn from pitch to pitch, hone in on what a pitcher has to offer and
adjust to the pitcher’s plan of attack.
Fouls can feel like a neutral outcome when they happen on the field,
but they are almost always a positive outcome for either the pitcher or
the hitter. For example, with less than two strikes, a foul ball is a
positive outcome for the pitcher. Certain foul balls, like a towering
drive that hooks foul, can indicate that the batter has the upper hand,
but it still counts like any other strike, and the pitcher can breath a
sigh of relief. When there are two strikes, a foul ball is always going
to feel like the batter held strong, and the pitcher is annoyed that he
has to throw another pitch. In short, a foul ball is a win for the
pitcher with less than 2 strikes, and it is a win for the hitter with 2
strikes. With this in mind, I wanted to split hitter and pitcher foul
ball tendencies based on the count.
batters.2023.df <- statcast_leaderboard$batter %>%
filter(pa >= 200) %>%
mutate(Name = sub("(.+),\\s(.+)","\\2 \\1", `last_name, first_name`)) %>%
select(-year, -`last_name, first_name`, -est_ba_minus_ba_diff,
-est_slg_minus_slg_diff, -est_woba_minus_woba_diff) %>%
rename(PA = pa) %>%
# All pitches
inner_join(
'
SELECT
batter player_id,
COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
/ COUNT(*) `0|1-Strike Whiff %`,
COUNTIF(description = "foul") / COUNT(*) `0|1-Strike Foul %`
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND strikes < 2 AND
description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
"swinging_strike_blocked") /* Swing */
GROUP BY
batter;
' %>%
dbGetQuery(bq, .),
by = "player_id"
) %>%
# 2-Strikes
inner_join(
'
SELECT
batter player_id,
COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
/ COUNT(*) `2-Strike Whiff %`,
COUNTIF(description = "foul") / COUNT(*) `2-Strike Foul %`
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND strikes = 2 AND
description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
"swinging_strike_blocked") /* Swing */
GROUP BY
batter;
' %>%
dbGetQuery(bq, .),
by = "player_id"
) %>%
rename(`0/1-Strike Whiff %` = `0|1-Strike Whiff %`,
`0/1-Strike Foul %` = `0|1-Strike Foul %`)
pitchers.2023.df <- statcast_leaderboard$pitcher %>%
filter(pa >= 200) %>%
mutate(Name = sub("(.+),\\s(.+)","\\2 \\1", `last_name, first_name`)) %>%
select(-year, -`last_name, first_name`, -est_ba_minus_ba_diff,
-est_slg_minus_slg_diff, -est_woba_minus_woba_diff) %>%
rename(TBF = pa) %>%
# All pitches
inner_join(
'
SELECT
pitcher player_id,
COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
/ COUNT(*) `0|1-Strike Whiff %`,
COUNTIF(description = "foul") / COUNT(*) `0|1-Strike Foul %`
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND strikes < 2 AND
description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
"swinging_strike_blocked") /* Swing */
GROUP BY
pitcher;
' %>%
dbGetQuery(bq, .),
by = "player_id"
) %>%
# 2-Strikes
inner_join(
'
SELECT
pitcher player_id,
COUNTIF(description IN ("swinging_strike", "foul_tip", "swinging_strike_blocked"))
/ COUNT(*) `2-Strike Whiff %`,
COUNTIF(description = "foul") / COUNT(*) `2-Strike Foul %`
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND strikes = 2 AND
description IN ("foul", "hit_into_play", "swinging_strike", "foul_tip",
"swinging_strike_blocked") /* Swing */
GROUP BY
pitcher;
' %>%
dbGetQuery(bq, .),
by = "player_id"
) %>%
rename(`0/1-Strike Whiff %` = `0|1-Strike Whiff %`,
`0/1-Strike Foul %` = `0|1-Strike Foul %`)
2023 Batters (min. 200 PAs)
First, I want to highlight the extremes: the guys who hit a ton of
foul balls and the guys who hit very few.
10 Highest
batters.2023.df %>%
select(Name, PA, `0/1-Strike Foul %`) %>%
arrange(desc(`0/1-Strike Foul %`)) %>%
mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
PA
|
0/1-Strike Foul %
|
Jake Cronenworth
|
522
|
48.78%
|
Isaac Paredes
|
571
|
48.43%
|
Bo Naylor
|
230
|
46.54%
|
Ozzie Albies
|
660
|
45.89%
|
Nathaniel Lowe
|
724
|
45.59%
|
Gio Urshela
|
228
|
45.05%
|
Zach McKinstry
|
518
|
45.03%
|
Max Kepler
|
491
|
44.52%
|
Adley Rutschman
|
687
|
44.26%
|
Pavin Smith
|
228
|
44.20%
|
batters.2023.df %>%
select(Name, PA, `2-Strike Foul %`) %>%
arrange(desc(`2-Strike Foul %`)) %>%
mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
PA
|
2-Strike Foul %
|
Yordan Alvarez
|
496
|
46.95%
|
Sal Frelick
|
223
|
46.75%
|
Geraldo Perdomo
|
495
|
46.02%
|
Justin Turner
|
626
|
45.81%
|
Ty France
|
665
|
45.59%
|
Isiah Kiner-Falefa
|
361
|
45.55%
|
Cody Bellinger
|
556
|
45.43%
|
Daulton Varsho
|
581
|
45.11%
|
Will Smith
|
554
|
44.87%
|
Alec Burleson
|
347
|
44.87%
|
10 Lowest
batters.2023.df %>%
select(Name, PA, `0/1-Strike Foul %`) %>%
arrange(`0/1-Strike Foul %`) %>%
mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
PA
|
0/1-Strike Foul %
|
Christian Bethancourt
|
332
|
29.48%
|
William Contreras
|
611
|
29.57%
|
Eloy Jiménez
|
489
|
29.92%
|
Javier Báez
|
547
|
30.26%
|
Kevin Kiermaier
|
408
|
30.29%
|
Aaron Judge
|
458
|
30.75%
|
Jose Siri
|
364
|
30.82%
|
Joey Wiemer
|
410
|
30.87%
|
Jordan Walker
|
465
|
30.91%
|
Luke Raley
|
406
|
31.07%
|
batters.2023.df %>%
select(Name, PA, `2-Strike Foul %`) %>%
arrange(`2-Strike Foul %`) %>%
mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
PA
|
2-Strike Foul %
|
Jazz Chisholm Jr.
|
383
|
23.53%
|
Mark Vientos
|
233
|
25.32%
|
Jose Siri
|
364
|
27.57%
|
Eloy Jiménez
|
489
|
27.94%
|
Harrison Bader
|
344
|
28.64%
|
Brett Baty
|
389
|
28.69%
|
Francisco Alvarez
|
423
|
28.87%
|
Christopher Morel
|
429
|
29.08%
|
Mickey Moniak
|
323
|
29.51%
|
Paul DeJong
|
400
|
29.64%
|
# https://plotly.com/ggplot2/configuration-options/
# Highlight POIs: https://thiyanga.netlify.app/post/scatterplot/
batter.scatter.plot.df <- batters.2023.df %>%
mutate(diff = `2-Strike Foul %` - `0/1-Strike Foul %`,
color = as.factor(
ifelse(diff >= 0.06, "#025189",
ifelse(diff >= 0.03, "#0c9cb4",
ifelse(diff >= 0, "#94c280",
ifelse(diff >= -0.03, "#f1c359",
ifelse(diff >= -0.06, "#d03f2e",
"#982123"))))))) %>%
select(Name, `2-Strike Foul %`, `0/1-Strike Foul %`, color)
(batter.scatter.plot.df %>%
ggplot(
aes(x = `0/1-Strike Foul %`, y = `2-Strike Foul %`,
text = paste(Name, "\n0/1-Strike Foul %: ",
percent(`0/1-Strike Foul %`, accuracy = 0.1),
"\n2-Strike Foul %: ",
percent(`2-Strike Foul %`, accuracy = 0.1), sep = ""))) +
geom_point(aes(color = color)) +
scale_colour_manual(values = levels(batter.scatter.plot.df$color)) +
labs(title = "2023 Batters (min. 200 PAs)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
geom_abline(linetype = "dotted") +
theme(legend.position = "none")) %>%
ggplotly(tooltip = "text") %>%
layout(annotations = list(
list(text = "Hover over any point for player details", x = 0.437, y = 0.462,
font = list(size = 10)),
list(text = "y = x", x = 0.484, y = 0.476, font = list(size = 10),
showarrow = FALSE))) %>%
config(displayModeBar = FALSE)
There is a positive correlation between 0/1-Strike Foul % and
2-Strike Foul % (correlation coefficient of 0.42), but substantial
variance exists, too. Take Yordan Alvarez, for example. With less than 2
strikes, he fouls off only 34.1% of pitches, but that number increases
to 46.9% with 2 strikes, which is the highest in all of MLB.
(
'WITH
team_foul_pct
AS
(
SELECT
CASE inning_topbot WHEN "top" THEN away_team ELSE home_team END team,
strikes,
COUNTIF(description = "foul") fouls,
COUNT(*) pitches,
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND
description IN ("foul", "hit_into_play", "swinging_strike",
"foul_tip", "swinging_strike_blocked") /* Swing */
GROUP BY
team,
strikes
)
SELECT
team,
`0|1-Strike Foul %`,
`2-Strike Foul %`
FROM
(
SELECT
team,
SUM(fouls) / SUM(pitches) `0|1-Strike Foul %`
FROM
team_foul_pct
WHERE
strikes < 2
GROUP BY
team
)
JOIN
(
SELECT
team,
SUM(fouls) / SUM(pitches) `2-Strike Foul %`
FROM
team_foul_pct
WHERE
strikes = 2
GROUP BY
team
)
USING
(team);' %>%
dbGetQuery(bq, .) %>%
rename(`0/1-Strike Foul %` = `0|1-Strike Foul %`) %>%
left_join(fromJSON("https://statsapi.mlb.com/api/v1/teams?lang=en&sportId=1&season=2023")$teams %>%
mutate(logo = paste("https://www.mlbstatic.com/team-logos/", id, ".svg", sep = "")) %>%
select(abbreviation, logo),
by = c("team" = "abbreviation")) %>%
ggplot(aes(x = `0/1-Strike Foul %`, y = `2-Strike Foul %`,
text = paste(team, "\n0/1-Strike Foul %: ",
percent(`0/1-Strike Foul %`, accuracy = 0.1),
"\n2-Strike Foul %: ",
percent(`2-Strike Foul %`, accuracy = 0.1), sep = ""))) +
geom_image(aes(image = logo), size = 0.075, by = "height") +
labs(title = "2023 Teams (Batting)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
geom_abline(linetype = "dotted") +
theme(legend.position = "none")) #%>%
# ggplotly(tooltip = "text") %>%
# layout(annotations = list(
# list(text = "Hover over any point for team details", x = 0.377, y = 0.3965,
# font = list(size = 10)),
# list(text = "y = x", x = 0.388, y = 0.39, font = list(size = 10),
# showarrow = FALSE))) %>%
# config(displayModeBar = FALSE)
It is surprising to see that the top 2 run-scoring offenses in 2023
(Braves and Dodgers) are both below the dotted line here. Meanwhile,
struggling offenses like the Royals and A’s both have a much higher
2-Strike Foul % than 0/1-Strike Foul %.
2023 Pitchers (min. 200 batters faced)
Similar to how certain hitters are much more foul ball prone than
others, pitchers can be the same way.
10 Highest
pitchers.2023.df %>%
select(Name, TBF, `0/1-Strike Foul %`) %>%
arrange(desc(`0/1-Strike Foul %`)) %>%
mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
TBF
|
0/1-Strike Foul %
|
Brusdar Graterol
|
257
|
45.57%
|
Chris Murphy
|
212
|
45.31%
|
Luis Severino
|
417
|
44.93%
|
Johnny Cueto
|
218
|
44.36%
|
Nestor Cortes
|
266
|
44.34%
|
Steven Wilson
|
219
|
44.21%
|
Joe Ryan
|
672
|
44.04%
|
Brad Hand
|
236
|
44.00%
|
Louie Varland
|
283
|
43.34%
|
Cody Bradford
|
234
|
43.08%
|
pitchers.2023.df %>%
select(Name, TBF, `2-Strike Foul %`) %>%
arrange(desc(`2-Strike Foul %`)) %>%
mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
TBF
|
2-Strike Foul %
|
Ron Marinaccio
|
205
|
51.41%
|
Josh Hader
|
231
|
47.62%
|
Sean Manaea
|
499
|
46.85%
|
Drew Smith
|
244
|
46.73%
|
Reynaldo López
|
278
|
46.33%
|
Brock Burke
|
250
|
46.07%
|
Johnny Cueto
|
218
|
45.71%
|
Kyle Muller
|
372
|
45.66%
|
Jhony Brito
|
372
|
45.59%
|
Jake Irvin
|
530
|
45.57%
|
10 Lowest
pitchers.2023.df %>%
select(Name, TBF, `0/1-Strike Foul %`) %>%
arrange(`0/1-Strike Foul %`) %>%
mutate(`0/1-Strike Foul %` = percent(`0/1-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
TBF
|
0/1-Strike Foul %
|
Robert Stephenson
|
201
|
25.29%
|
Elvis Peguero
|
252
|
25.82%
|
Andrés Muñoz
|
211
|
27.80%
|
Giovanny Gallegos
|
229
|
28.28%
|
Alex Lange
|
288
|
28.71%
|
Josh Sborz
|
215
|
28.92%
|
Gregory Santos
|
289
|
28.96%
|
Bryan Abreu
|
287
|
29.15%
|
Josh Fleming
|
221
|
29.39%
|
Alex Young
|
236
|
29.79%
|
pitchers.2023.df %>%
select(Name, TBF, `2-Strike Foul %`) %>%
arrange(`2-Strike Foul %`) %>%
mutate(`2-Strike Foul %` = percent(`2-Strike Foul %`, accuracy = 0.01)) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
Name
|
TBF
|
2-Strike Foul %
|
Alex Lange
|
288
|
25.00%
|
Yency Almonte
|
207
|
25.86%
|
Phil Maton
|
274
|
26.63%
|
Mark Leiter Jr.
|
269
|
26.83%
|
Alexis Díaz
|
286
|
27.66%
|
Quinn Priester
|
234
|
28.00%
|
Jordan Romano
|
248
|
28.12%
|
Gregory Soto
|
250
|
28.26%
|
Albert Abreu
|
268
|
28.66%
|
Drew VerHagen
|
268
|
28.82%
|
# https://plotly.com/ggplot2/configuration-options/
# Highlight POIs: https://thiyanga.netlify.app/post/scatterplot/
pitcher.scatter.plot.df <- pitchers.2023.df %>%
mutate(diff = `2-Strike Foul %` - `0/1-Strike Foul %`,
color = as.factor(
ifelse(diff >= 0.06, "#982123",
ifelse(diff >= 0.03, "#d03f2e",
ifelse(diff >= 0, "#f1c359",
ifelse(diff >= -0.03, "#94c280",
ifelse(diff >= -0.06, "#0c9cb4",
"#025189"))))))) %>%
select(Name, `2-Strike Foul %`, `0/1-Strike Foul %`, color)
(pitcher.scatter.plot.df %>%
ggplot(
aes(x = `0/1-Strike Foul %`, y = `2-Strike Foul %`,
text = paste(Name, "\n0/1-Strike Foul %: ",
percent(`0/1-Strike Foul %`, accuracy = 0.1),
"\n2-Strike Foul %: ",
percent(`2-Strike Foul %`, accuracy = 0.1), sep = ""))) +
geom_point(aes(color = color)) +
scale_colour_manual(values = levels(pitcher.scatter.plot.df$color)) +
labs(title = "2023 Pitchers (min. 200 batters faced)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
geom_abline(linetype = "dotted") +
theme(legend.position = "none")) %>%
ggplotly(tooltip = "text") %>%
layout(annotations = list(
list(text = "Hover over any point for player details", x = 0.316, y = 0.444,
font = list(size = 10)),
list(text = "y = x", x = 0.461, y = 0.47, font = list(size = 10),
showarrow = FALSE))) %>%
config(displayModeBar = FALSE)
Do foul ball tendencies relate to K%
Foul balls never end an at bat, but they can have an affect on the
eventual outcome of an at bat, either by adding a strike to the count or
keeping the batter alive for another pitch. While I was not able to see
much of a relationship between 0/1-Strike Foul % and wOBA or 2-Strike
Foul % and wOBA, there were some noteworthy trends when comparing the
Foul % stats to K%.
# Add K%
batters.2023.df <- batters.2023.df %>%
select(-starts_with("K")) %>%
left_join(
'
SELECT
batter player_id,
COUNT(*) K
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND
events IN("strikeout", "strikeout_double_play")
GROUP BY
batter;
' %>%
dbGetQuery(bq, .), by = "player_id") %>%
mutate(`K%` = K / PA)
k.pct.lm.0.1 <- lm(`K%` ~ `0/1-Strike Foul %`, data = batters.2023.df)
k.pct.lm.2 <- lm(`K%` ~ `2-Strike Foul %`, data = batters.2023.df)
batters.2023.df %>%
ggplot(aes(x = `0/1-Strike Foul %`, y = `K%`,
text = paste(Name, "\n0/1-Strike Foul %: ",
percent(`0/1-Strike Foul %`, accuracy = 0.1), "\nK%: ",
percent(`K%`, accuracy = 0.1), sep = ""))) +
geom_point(color = "#0c9cb4") +
labs(title = "2023 Batters (min. 200 PAs)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme(legend.position = "none") +
geom_abline(slope = k.pct.lm.0.1$coefficients[2],
intercept = k.pct.lm.0.1$coefficients[1], color = "#BD9B60",
linewidth = 0.5) +
annotate("label", x = 0.45, y = 0.4, hjust = 0,
label = paste("r =", round(cor(batters.2023.df$`K%`,
batters.2023.df$`0/1-Strike Foul %`), 2)))
batters.2023.df %>%
ggplot(aes(x = `2-Strike Foul %`, y = `K%`,
text = paste(Name, "\n2-Strike Foul %: ",
percent(`2-Strike Foul %`, accuracy = 0.1), "\nK%: ",
percent(`K%`, accuracy = 0.1), sep = ""))) +
geom_point(color = "#0c9cb4") +
labs(title = "2023 Batters (min. 200 PAs)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme(legend.position = "none") +
geom_abline(slope = k.pct.lm.2$coefficients[2],
intercept = k.pct.lm.2$coefficients[1], color = "#BD9B60",
linewidth = 0.5) +
annotate("label", x = 0.45, y = 0.4, hjust = 0,
label = paste("r =", round(cor(batters.2023.df$`K%`,
batters.2023.df$`2-Strike Foul %`), 2)))
Not surprisingly, there is a trend that batters whose 2-strike swings
often result in fouls tend to strike out less often. However, I am
surprised that frequent foul hitters with less than two strikes also
tend to strike out less. With those types of fouls, a batter is getting
one strike closer to a potential strikeout, but as we learned earlier,
these same guys tend to also foul off lots of two-strike pitches and
stay alive.
While these trends are notable for batters, we see much less
correlation between foul balls and strikeouts with pitchers:
# Add K%
pitchers.2023.df <- pitchers.2023.df %>%
select(-starts_with("K")) %>%
left_join(
'
SELECT
pitcher player_id,
COUNT(*) K
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND
events IN("strikeout", "strikeout_double_play")
GROUP BY
pitcher;
' %>%
dbGetQuery(bq, .), by = "player_id") %>%
mutate(`K%` = K / TBF)
k.pct.lm.0.1 <- lm(`K%` ~ `0/1-Strike Foul %`, data = pitchers.2023.df)
k.pct.lm.2 <- lm(`K%` ~ `2-Strike Foul %`, data = pitchers.2023.df)
pitchers.2023.df %>%
ggplot(aes(x = `0/1-Strike Foul %`, y = `K%`,
text = paste(Name, "\n0/1-Strike Foul %: ",
percent(`0/1-Strike Foul %`, accuracy = 0.1), "\nK%: ",
percent(`K%`, accuracy = 0.1), sep = ""))) +
geom_point(color = "#d03f2e") +
labs(title = "2023 Pitchers (min. 200 batters faced)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme(legend.position = "none") +
geom_abline(slope = k.pct.lm.0.1$coefficients[2],
intercept = k.pct.lm.0.1$coefficients[1], color = "#BD9B60",
linewidth = 0.5) +
annotate("label", x = 0.425, y = 0.4, hjust = 0,
label = paste("r =", round(cor(pitchers.2023.df$`K%`,
pitchers.2023.df$`0/1-Strike Foul %`), 2)))
pitchers.2023.df %>%
ggplot(aes(x = `2-Strike Foul %`, y = `K%`,
text = paste(Name, "\n2-Strike Foul %: ",
percent(`2-Strike Foul %`, accuracy = 0.1), "\nK%: ",
percent(`K%`, accuracy = 0.1), sep = ""))) +
geom_point(color = "#d03f2e") +
labs(title = "2023 Pitchers (min. 200 batters faced)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme(legend.position = "none") +
geom_abline(slope = k.pct.lm.2$coefficients[2],
intercept = k.pct.lm.2$coefficients[1], color = "#BD9B60",
linewidth = 0.5) +
annotate("label", x = 0.45, y = 0.4, hjust = 0,
label = paste("r =", round(cor(pitchers.2023.df$`K%`,
pitchers.2023.df$`2-Strike Foul %`), 2)))
Hard hit foul balls?
Since Statcast still tracks the exit velocity of foul balls, I wanted
to examine not just the frequency with which players hit foul balls, but
also how hard they are hitting them (Avg. EV and Hard Hit %)
(batters.2023.df %>%
left_join(
'
SELECT
batter player_id,
SAFE_DIVIDE(COUNTIF(launch_speed >= 95 AND description = "hit_into_play"),
COUNTIF(description = "hit_into_play")) `Hard-Hit %`,
SAFE_DIVIDE(COUNTIF(launch_speed >= 95 AND description = "foul"),
COUNTIF(description = "foul")) `Foul Balls Hard-Hit %`
FROM
`mlb.statcast_pitches`
WHERE
game_year = 2023 AND game_type = "R" AND
launch_speed IS NOT NULL
GROUP BY
batter;
' %>%
dbGetQuery(bq, .) %>%
rename(`Fair Balls: Hard-Hit %` = `Hard-Hit %`,
`Foul Balls: Hard-Hit %` = `Foul Balls Hard-Hit %`),
by = "player_id") %>%
ggplot(aes(x = `Foul Balls: Hard-Hit %`, y = `Fair Balls: Hard-Hit %`,
text = paste(Name, "\nFoul Balls Hard-Hit %: ",
percent(`Foul Balls: Hard-Hit %`, accuracy = 0.1),
"\nFair Balls Hard-Hit %: ",
percent(`Fair Balls: Hard-Hit %`, accuracy = 0.1),
sep = ""))) +
geom_point(color = "#0c9cb4") +
labs(title = "2023 Batters (min. 200 PAs)") +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
theme(legend.position = "none") +
geom_abline(linetype = "dotted")) %>%
ggplotly(tooltip = "text") %>%
layout(annotations = list(
list(text = "Hover over any point for player details", x = 0.171, y = 0.492,
font = list(size = 10)))) %>%
config(displayModeBar = FALSE)
Obviously, no players are going to have a higher hard-hit rate (95
mph off the bat or higher) on foul balls than fair balls, but some
players do waste their solid contact by spraying lasers into foul
territory. Isaac Paredes had 2023’s smallest disparity between hard-hit
rate on fair balls and foul balls. Given that he has pulled every single
one of his 53 career homers to left field, it is easy to envision him
catching some balls on the barrel but too far out in front.
Main Takeways and Future Explorations
In the end, we see successful hitters and pitchers all over the map
in terms of how often and when they hit/give up foul balls. In general,
hitters whose swings result in a lot of foul balls tend to strike out
less, but the trends are not strong enough to declare that fouling off
more pitches should be a key part of a hitter’s approach. This
investigation was never meant to uncover a new key metric for hitters
and pitchers. Instead, I wanted to try and learn something from a
frequent pitch outcome that is infrequently discussed. The main takeaway
is that there are some players of interest whose approaches are worth
examining further after seeing their tendencies relating to foul
balls.
Yordan Alvarez reputation as a bruising slugger makes it surprising
that he is excellent at spoiling two-strike pitches and prolonging at
bats. On the flip side, Jazz Chisholm, Jr. is a young star who hits very
few foul balls and shows a lot of promise at the plate. One pitcher that
stood out in this study is Alexis Diaz. The young Reds closer sees a
13.4% decrease in Foul % on two-strike swings versus swings with less
than two strikes. Given his recent success and ability to miss bats, it
would be interesting to know what he does differently in potential
strikeout scenarios.
dbDisconnect(bq)