PJB Sports Data
Introduction
Much of the MLB research I have done in recent years has involved loading Statcast data from Baseball Savant. This publicly-available data provides an incredible opportunity to conduct in-depth studies of what we see happening on the diamond. However, I found that I was left with two options when working with this data:
- Query all the statcast pitches from a given time range
- Examples
- Python: pybaseball’s statcast function
- R: baseballr’s statcast_search function
Pros Cons Once this data is loaded, it is extremely detailed. There are many analysis options from here Slow: getting all the pitches from 5 days of games takes almost a minute Risks bogging down Baseball Savant’s servers - Examples
-
Loading aggregated data
Pros Cons Faster It would only answer one question but new data would have to be queried to address other questions
In the end, the point of this project is to knock out a lot of the data engineering work that inevitably has to happen with most analytics projects. Hopefully, I will be able to
Central Data Warehouse
With this in mind, I decided that I would build out my own sports data warehouse where I could…
- Efficiently query data from tables
- Map tables before loading to memory
- Avoid bogging down external sources by querying daily/weekly/etc. to update my data warehouse
- Build scripts and dashboards to better understand data
BigQuery
I chose Google BigQuery as a data warehouse because…
- 10 GB free storage each month
- 1 TB free queries each month
- SQL!
- Google Cloud is a valuable skill in the data engineering world
- Easy connection to Python, R, etc.
- Easy setup compared to Oracle, in my personal opinion.
Current Data Sources
Links
For more technical details of the project, visit the
GitHub Repository