bsale

Summary

Ballmart is my end-to-end college basketball analysis project. This project began when I was first learning R and continues today, more than 9 years later. It has had many iterations and deployments!

The project is inspired by Ken Pomeroy’s work on college basketball. At first, I was just curious to see if I could recreate his ratings system. This quickly evolved into wanting to make small tweaks to the system myself. I found that this project gave me a greater appreciation for college basketball and sports analytics in general.

An Abridged Chronological History

2015-2016: I wanted to make an archive of kenpom ratings, so I used his webpage to learn scraping in R. Once I accomplished pulling the data, I started to use them to predict the scores and outcomes of the coming day’s games. This was awesome as I was living in the dormitories at this time and was able to watch a lot of games, seeing how performance on the court resulted in changes within the ratings.

2017: I was looking to create my own ratings system, which required pulling my own data. I built an API wrapper for the NCAA statistics website and began automating my data gathering exercise. This resulted in a lot of data! At this point, I was content storing the data within csv files. This exercise taught me about cleaning raw source system data, how not to “maintain a database” (aka csv files), and cron scheduling.

2018: I finally was able to determine how the kenpom ratings were being generated, and started to make them myself (or at least got close to how he is/was doing it; I believe there is some special sauce that he does not include in his blog). I began tinkering with some aspects of the ratings: home court advantage, effects on performance as the season goes on. At this point, I had lots of historical data on which I could backtest changes.

2019: The database era. I repurposed an old desktop computer that I had: installed Ubuntu server, installed a Postgres database on it, and uploaded all of the data that I had gathered to that point. I spent a lot of time applying what I had learned in the database management classes I took at University of Arkansas such as normalization and data typing. Takeaways: it was nice to have a single source of data that I could connect to at any time (instead of the csv “databases” of years past); however, I learned very quickly that the data that I had been gathering had not been as clean as I had originally thought.

2020: I was in my Master’s program, and therefore did not dedicate much time to improving the project; additionally, this was during the Coronavirus situation, which resulted in no 2020 March Madness and an altered beginning to the 2021 season. I did, however spend this time working on a CS:GO esports project which involved a similar process: scraping, inserting data into the Postgres database, then analyzing within R.

2021: I found an R package called ncaahoopR that provided an API wrapper to ESPN college basketball data. I began storing the data gathered using this tool alongside the data from the NCAA statistics website within my Postgres database. The data coming from this tool was slightly cleaner, albeit less normalized, than the data processing that I had been doing with the NCAA stats website data. After doing some validation of the new data, I began using that data more since it was cleaner.

2022: A philosophy shift. This was my second year formally working for a company and doing data work every day. My main takeaways from this data work was to a) not fight the data source, b) use the tools that are best for whatever job you are doing, and c) make those tols work together in a predictable and modular fashion. For this project, this philosophy shift took the following shape: a) Instead of scraping data and cleaning it within R or Python before inserting it into the database, just take the resulting JSON document that contains the data response from the API call and store it, in its entirety, within the database (fortunately, Postgres has some cool JSON ooperations and data types). b) use dbt to maintain the sql models within the database; this allows for version control of the underlying sql code (both DDL and DML), visibility into what is happening to the raw data, and control over what processes run and when they run. c) take the modular processes and have a tool (dagster) that orchestrates the entire process and handoffs between the different tools.

This was exciting as it was my first foray into using open-source tools and integrating them with the code that I had been working on. I also “onboarded” a couple of new data sources for my work: NatStat and the odds api. I took data from these two sources, created a light wrapper for the ESPN API, and ultimately began storing raw JSON data from these three sources within the Postgres database.

I would be remiss if I did not mention that my friends were an instrumental part in this ballmart project: they always bring up good ideas and want to be a part of the process: discussing ideas for the future, motivating me to think of something in a different way than I had before, those kinds of things. dagster allowed me to pay them back, in a way. I started a Discord server and created a bot that posted the daily predictions each morning. This ended up being a very fun and private way to engage the people who contributed to the project.

2023: I used this website (technically a subdomain) to publish what had previously been going on a Discord server. Unfortunately, due to some life events, I did not spend too much time writing or conducting analysis. Fortunately, however, the data pipelines built in dbt and dagster allowed me to run that website completely hands free!

2024 (upcoming): I am evaluating using some different tools to enable player-level analysis. Since my database server is an old dual-core computer, I have reached some fun limitations regarding the analysis I am able to (realistically) conduct within the season. I am planning on trying an OLAP database engine such as Clickhouse. Another route that I am interested in pursuing is maintaining a data lakehouse using an open table format like Iceberg; this would allow my choice of compute engine to be a tad more agnostic to how/where the data is stored. Regardless, I am hoping that this backend work enables some more fun end products and use cases!