Methodology: How we built The Forecheck, our home for NHL predictions
Welcome to The Forecheck, The Globe and Mail's home for 2017-18 NHL predictions.
Developed by Globe journalists and data scientists, The Forecheck ranks among the best models in predicting NHL games. It's also a work in progress that we'll update as necessary. Here's how the model works and how it's evolved.
Update (Jan. 25, 2018): We made several changes to The Forecheck's methodology and functionality, thanks to reader feedback. Here's an overview of the changes:
- Readers can now compare our past predictions with actual results. They can also see our accuracy rate for individual teams by clicking on a team in the playoff table.
- Our predictive process now runs several times a day, rather than once every morning, to account for last-minute roster changes.
- We introduced Markov chain Monte Carlo simulations to our game model. Read more about it here.
- We now include overtime losses in our points projections. Read more about it here.
- Due to an implementation error relating to how we processed new roster data, some game probabilities were retroactively changed, which meant our old model’s overall accuracy figure was incorrect. This has been fixed. Read more about our accuracy figures here.
What exactly are you predicting?
All game outcomes, along with each team's chances of making the playoffs.
How often are your predictions updated?
Several times a day. The process starts at about 4 a.m. and runs until the start of the first game of the day. At that point, the predictions are "locked in" until the process starts again the following day, when new game data are available.
How does this process actually work?
Each time the process runs, new data are fed into our game model, and subsequently it generates new predictions for all remaining games in the season.
What data are fed into your game model?
Roughly 2,000 team and player variables are used, including data from current and past seasons. (Data are provided by Sportradar.) Not every variable is equally weighted, mind you. The model assigns varying weights without human interference through machine learning. Essentially, the model will find that some variables have stronger correlations with game outcomes than others, and weight those variables accordingly. For instance, Connor McDavid's performance should have a greater bearing on the Edmonton Oilers' success than a fourth-line winger. The model will account for that. Plus, more recent data should have a greater bearing on a team's success than events from years ago. The model will account for that, too.
What type of model do you use for game predictions and why?
The Globe's data-science team tested several models using data from the past three decades of NHL seasons. These included decision tree, logistic regression, conditional random field and Elo-based models. Their accuracy ranged from about 56 per cent to 59 per cent for game-by-game predictions. Ultimately, the most accurate model was a multilayer perceptron, which is a feed-forward artificial-neural network.
The first version of the model was susceptible to streakiness. Thus, if a team was on an extended winning or losing streak, the model thought that trend would last for the foreseeable future. To address this bias, we added a Markov chain Monte Carlo method (MCMC for short) to our game model.
Markov chain… Monte… Carlo?
Basically, it allows us to simulate all remaining games in the season 10 times. Eventually, we hope to increase the number of simulations, but we're limited by the amount of processing power required to run them. (Right now, these simulations alone take almost 30 minutes on our servers.)
Here's how an MCMC simulation works: We assign a random number between 0 and 1 for each game and compare that figure to the home team's probability of winning the game in question. If the random number is equal to or smaller than the home team's win probability, the team notches a win. If the random number is higher, the team is given a loss. We then use the result as input for the team's future game predictions.
Ultimately, this process helps us mitigate our model's streakiness bias.
However, because we're only running 10 simulations, we also using a "voting system" that, for future games, takes a weighted average of probabilities from the previous five days. This helps to curb volatility in our game predictions.
You said the Leafs had a 55-per-cent chance of winning a game. Then they lost. What gives?
The Forecheck assigns probabilities to game outcomes and playoff chances. Thus, if the Toronto Maple Leafs have a 55-per-cent chance of beating the Montreal Canadiens, that doesn't mean the Leafs will win. It only means the Leafs have a better shot of prevailing, based on our model and the data available to that point.
How do the playoff probabilities work?
Once again, we use an MCMC model. We take our final game predictions and simulate the remainder of the season 10,000 times. (Initially, we ran 100,000 simulations, but found 10,000 was enough to generate the figures we needed.) This allows us to generate a projected point total for all teams, and subsequently assess their playoff chances.
The first version of the model didn't award single points for overtime losses. Several readers wanted this feature, so we now distribute overtime loss points based on a team's past propensity to lose games in this fashion.
How accurate is your game model?
Our new model has accurately predicted 62.9 per cent of games played this season, as of Jan. 24.
However, given updates to our model, we wanted to be as transparent as possible about our accuracy figures. For one, as mentioned in our update log, we miscalculated the accuracy figure from our old model, due to an error that retroactively changed some win-probability figures. We've fixed the bug and now "lock in" our final game predictions before game time.
As we made adjustments, we manually recorded our final predictions to compare with actual results. Between Jan. 16 and 24, the old model accurately predicted 50.8 per cent of 59 games played – not a fantastic rate, but justifiable given the small sample size.
We now list three accuracy figures at the bottom of The Forecheck:
- The old model’s accuracy rate (50.8 per cent) from its short internet life;
- the new model’s season-to-date accuracy rate;
- and our accuracy rate since the new model was launched on Jan. 25.
As noted above, we also introduced new features that allow readers to see how our predictions fared in past games, along with our accuracy figures for particular teams.
Is 62.9-per-cent accuracy actually good? That sounds low.
NHL games are notoriously difficult to predict. A team can dominate play in regulation, for instance, only to lose in a shootout. Regardless, 62.9 per cent is considered extremely accurate.
Did you guys just get lucky?
Fortunately, no. Our logarithmic loss value for the current model — a rating of both the overall accuracy and confidence of a predictive model, where a lower number is better — sits at 0.6479, which is also considered exceptionally good.
Have any further comments or questions? Email us at firstname.lastname@example.org