PJB Sports Data

Introduction

Much of the MLB research I have done in recent years has involved loading Statcast data from Baseball Savant. This publicly-available data provides an incredible opportunity to conduct in-depth studies of what we see happening on the diamond. However, I found that I was left with two options when working with this data:

Query all the statcast pitches from a given time range

Examples
- Python: pybaseball’s statcast function
- R: baseballr’s statcast_search function

Pros	Cons
Once this data is loaded, it is extremely detailed. There are many analysis options from here	Slow: getting all the pitches from 5 days of games takes almost a minute
Risks bogging down Baseball Savant’s servers

Loading aggregated data

Pros	Cons
Faster	It would only answer one question but new data would have to be queried to address other questions

In the end, the point of this project is to knock out a lot of the data engineering work that inevitably has to happen with most analytics projects. Hopefully, I will be able to

Central Data Warehouse

With this in mind, I decided that I would build out my own sports data warehouse where I could…

Efficiently query data from tables
Map tables before loading to memory
Avoid bogging down external sources by querying daily/weekly/etc. to update my data warehouse
Build scripts and dashboards to better understand data

BigQuery

I chose Google BigQuery as a data warehouse because…

10 GB free storage each month
1 TB free queries each month
SQL!
Google Cloud is a valuable skill in the data engineering world
Easy connection to Python, R, etc.
Easy setup compared to Oracle, in my personal opinion.

Current Data Sources

MLB
- Statcast (via Baseball Savant)
- Injuries (via Fangraphs)

Links

For more technical details of the project, visit the
GitHub Repository

Pete Berryman

Introduction

Central Data Warehouse

BigQuery

Current Data Sources

Links