PJB Sports Data

Introduction

Much of the MLB research I have done in recent years has involved loading Statcast data from Baseball Savant. This publicly-available data provides an incredible opportunity to conduct in-depth studies of what we see happening on the diamond. However, I found that I was left with two options when working with this data:

  1. Query all the statcast pitches from a given time range
    Pros Cons
    Once this data is loaded, it is extremely detailed. There are many analysis options from here Slow: getting all the pitches from 5 days of games takes almost a minute
    Risks bogging down Baseball Savant’s servers  
  2. Loading aggregated data

    Pros Cons
    Faster It would only answer one question but new data would have to be queried to address other questions

In the end, the point of this project is to knock out a lot of the data engineering work that inevitably has to happen with most analytics projects. Hopefully, I will be able to

Central Data Warehouse

With this in mind, I decided that I would build out my own sports data warehouse where I could…

  1. Efficiently query data from tables
  2. Map tables before loading to memory
  3. Avoid bogging down external sources by querying daily/weekly/etc. to update my data warehouse
  4. Build scripts and dashboards to better understand data

BigQuery

I chose Google BigQuery as a data warehouse because…

  • 10 GB free storage each month
  • 1 TB free queries each month
  • SQL!
  • Google Cloud is a valuable skill in the data engineering world
  • Easy connection to Python, R, etc.
  • Easy setup compared to Oracle, in my personal opinion.

Current Data Sources

For more technical details of the project, visit the
GitHub Repository

Updated: