Over three years ago I started building a model to predict daily fantasy baseball scores. Over the first year and a half, I did a ton of research on figuring out what statistics are actually useful for prediction. I also built an ETL pipeline that was scraping data from the MLB and ultimately produced a dataset that could be used for creating a model.
My attempts at building a predictive model is ultimately where I failed. I tried using regression to build a model, but I was using it incorrectly. On top of that, my lineup selection was a manual process where I used my janky regression model for assistance.
Not surprisingly, when I put my model into practice at the beginning of the 2015 season my performance wasn't great. I didn't keep a detailed log of my entries, but I played almost every day for about two months betting $5 - $10 each day and over that stretch I lost about $70.
Fast forward a year, I started my graduate studies in analytics at Georgia Tech. In one of my classes, CSE 6242 (Data and Visual Analytics) I had the opportunity to create a group project and decided to continue working on my daily fantasy baseball model.
After taking over a year off, it was surprisingly easy to get the ETL pipeline working again. Good job past me! I decided to stick with a linear regression model because I was taking a regression class at the time and was comfortable creating a robust regression model.
We ended up creating two separate models, one for pitching and one for batting. After creating predictions for each player on each day of the season, the projections were run through linear optimization to select a model with the highest projected score that would fit under the salary cap (50,000 on DraftKings)
The results were surprisingly good for such a simple model and approach. On average, the lineups selected scored 138 points. For comparison, on DraftKings the average score in head-to-head matchups is 100.2 so we performed well above the average.
The output of the model is visualized here: http://faseball.herokuapp.com (note that it may take 10-15 seconds on initial load. Heroku free tier lol). You can select a date to view the lineup that the model picked. The table is sortable by column, and the sunburst visualization can be switched to show some different values.
Some takeaways and lessons learned:
- It was surprisingly hard to find features with predictive power. At this point, I have some ideas for features that may improve the model but I'm guessing the improvements would be minimal (if any).
- There were many features that I thought would surely be predictive that weren't. The most surprising one to me was in the pitching model, opposing starting batting lineup had no correlation at all.
- The batting model had a very low R^2 value, but the model was still good enough to score well above average. I think this has to do with the fact that players are picked by who will perform the best that day, not on how accurate the prediction is.
- We created a random forest model for batting prediction, but it surprisingly performed worse than the linear regression model.
- Our model didn't have many features but still performed well.