My thanks to mc79hockey for the retweet of the link to my first post on the Oilers Hackathon contest. It has been awhile since this site has had a bunch of folks stop by.
Over the last couple of days, I tackled the first of the four questions that make up the contest.
- Predict next regular season’s points/game for the players listed in appendix A.
My interpretation of the confidentiality acknowledgement precludes me from listing the players in appendix A. In my original draft I had characterized the list of players and wagered a guess on what the Oilers were trying to accomplish. My lovely wife, holder of a law degree, kindly informed me that I ought to reconsider.
So I will leave that out for now. Should Appendix A end up in the public domain, I will edit this post and add those two cents.
For better or worse, I decided to make use of the Oilers data when applicable. I could not find the data on a couple of established NHL players and used hockeydb.
After pulling the initial data, I broke the players into forwards and defensemen. I then made a general assumption on a NHL player’s peak age for point production: age 24 for forwards, age 26 for defensemen. As a check, I did a quick search and found that Gabe Desjardins put out an analysis suggesting age 25 is roughly the peak for production.
For those players below peak age, I simply took their totals from last year and increased them by 5%. That assumption is a tad conservative compared to Gabe’s graph.
For any players without NHL experience, I again relied on Gabe and his NHL equivalency numbers to convert their non-NHL numbers to a NHL estimate for the upcoming year.
Lastly, I made assumptions for those players beyond their peak. I started with a gut feeling of an annual decrease in productivity for forwards and defense of 20% and 15%. Gabe’s analysis suggests something closer to 5% at age 28 and 29. I revised my figures to 10% and 7.5%, on the assumption the decline gets steeper as individuals age.
As I was wrapping up my work on this question, I decided to stress test my predictions against the judging criteria. If my goal is simply to collect as many points as possible, I could justify going back to the original 20% and 15% decline rates.
The first three questions will be submitted via web form before the Contest Closing Time. Results will be judged using the mean absolute percentage error (MAPE) of the estimates versus the season actuals. The Entrant with the lowest MAPE will receive the full twenty points and the worst MAPE will receive zero points. The points of other Entrants will be distributed proportionately between zero and twenty.
Here is the definition from Wikipedia
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of accuracy of a method for constructing fitted time series values in statistics, specifically in trend estimation. It usually expresses accuracy as a percentage, and is defined by the formula:
where At is the actual value and Ft is the forecast value.
The difference between At and Ft is divided by the actual value At again. The absolute value in this calculation is summed for every fitted or forecasted point in time and divided again by the number of fitted points n. Multiplying by 100 makes it a percentage error.
The Oilers are judging the predictions on the average rather than looking at how I did player by player.
Simple example: estimate 0.5 points per game, actual +/- 25%. Overstatement of estimate = (0.375 – (0.5) / 0.375 = MAPE of -33%. Understatement of estimate = (0.625 – 0.5) / 0.625 = MAPE of 20%
For question 1, judging using the MAPE should cause a bias towards underestimating the point totals if possible. The reason is that all else equal, overstating resulting in a larger MAPE than understating.
I have not decided if I will adjust my estimates from the 10% and 7.5% decline. Part of my brain wants to simply come up with the best estimate possible. The rest of my brain loves the idea of working back into an estimate that takes advantage of the judging criteria.
Fortunately for questions 2 and 3, I think my best estimate will be optimal in minimizing MAPE. I will get started on the next question tomorrow and write about it later in the week.
DECEMBER 27TH EDIT. THE FOLLOWING IS A BIT MORE INFORMATION ABOUT THE NATURE OF APPENDIX A. IT WAS CUT FROM MY ORIGINAL POST.
As part of my analysis, I used the Oilers data to pull out the points/game figures for each player for the past three years. I will detail my methodology in a moment, but I pulled each player’s age as well. I noted a pretty narrow range of ages amongst the players, with players falling in roughly four categories:
- Current Oilers
- Prospective free agents
- Prospective restricted free agents
- Prospective veteran free agents (aka old guys)
My guess is the Oilers would like to predict points/game figures as part of their evaluation of potential free agent signings. Before we all laugh at the Oilers doing this, please note a) I could be wrong and b) we don’t know what else they do as part of their evaluation