Predicting the 2021 PGA Championship Using a Monte Carlo Simulation

featured image

Recently, a major golf tournament called the PGA Championship concluded. Astoundingly, pro golf has crowned its oldest major winning player, Phil Mickelson. Prior to the tournament, partly motivated by my participation in a pool, I attempted to predict the tournament outcome. This write-up is more of an explanation and walkthrough of my process. Nevertheless, the code can be found here.

Now, this golf tournament had 156 players initially in the field. In light of this, it is extremely difficult to accurately predict a winner. Golf also has a tremendous amount of variability in any given player's scores. I thought a more attainable goal was to try to figure out where players were most likely to place amongst their colleagues. Sure, I'd see who the model thought was going to win, but it should not be interpreted that the model's projected winner will be the actual winner.

I gave myself a single day: the day just before the tournament started. The tournament ran from May 20-23. Four rounds, with players needing to make a "cut" after their first two. Like any data science project, the first step was to think about the best way to generate predictions and develop a game plan.

Methodology and Data

Deriving the Core Metric

A PGA golf tournament is a competition amongst the best golfers from around the world. As such, a strong methodology would have to include some way to measure how a player performs relative to their competition. I figured a good way to do this is as follows:

  1. Gather a player's recent round score (total strokes)

  2. Measure the average score of the other players that played (average total strokes of field)

  3. Subtract the average total strokes from the player's total strokes to get a measure of how the player performed relative to the field

For example:

  1. Tiger Woods scored a 67

  2. The average player's score that day was a 72

  3. Tiger Woods had a score of -5 relative to the field

For any given player, we could take an arbitrary amount of past rounds to use as an input to our model. I happened to choose 40 rounds, though I can see a good argument as to why it may be better to choose only more recent rounds.

Data Source

Usually, it is tough to find good data that can lead to accurate predictions. Fortunately, I was able to find the pertinent data to construct my metric from Advanced Sports Analytics.

The Monte Carlo Method

Now for the model itself. I thought a good solution was to use the Monte Carlo method and simulate "tournaments."

The Monte Carlo method is a catch-all term for algorithms that use randomness to actually solve complex problems with many inputs or parameters. The method relies on drawing from a random sample to resolve uncertainty around complex problems. The Monte Carlo method is used across disciplines like physics, biology, and finance. An elegant example of the Monte Carlo method in action, is using it to calculate the value of pi.

I believe this problem suits the Monte Carlo method well as there is a tremendous amount of uncertainty with its inputs. The final position of each player depends on the positions of all other players. The Monte Carlo method allows us to feasibly deal with this.

For my purposes, I thought the best way to use the Monte Carlo method was to simulate scores relative to the field for each player (using the data described above). I do this 4 times for each player in the tournament (their four rounds) and sum the results. This gives me a random score relative to the field for the player. I do this for all the players and then rank all these scores (least to greatest). So the player at the top is the player who outperformed the field the most. In essence, they were simulated as the winner. This would constitute a "simulated tournament." Of course, the Monte Carlo method works by doing this many, many times. I arbitrarily chose to run the simulation 100,000 times to try to "converge" on a representative prediction.

But how did I choose to simulate their scores? This is again where I chose to make some assumptions. The biggest one, is that a player's score relative to the field (as derived above) comes from a normal distribution. I used each player's input data of scores relative to the field and found their mean and standard deviation. I then drew the metric from a normal distribution with the aforementioned mean and standard deviation. This is a common assumption when dealing with data where we don't know the identity of the distribution the data was sampled from. There are methods that can be used to test if the data is likely from a certain distribution. Perhaps next time I will do that before proceeding to the simulation. There are tonnes of things that can be applied to the Monte Carlo method in an effort to (hopefully) gain a better result. Given I only had a day, I decided to keep things simple.

Main Logic (Code)

Here you can see the main function that actually conducts the Monte Carlo simulation.

def mc_sim():
    _res_dfs = []
    for i in range(SIMS):
        scores = {}
        for player in players:
            p_df = df[df.index == player]
            mu = p_df['mean'].iloc[0]
            sigma = p_df['std'].iloc[0]
            p_scores = np.random.normal(mu, sigma, size=4)

            scores[player] = p_scores

        _res_df = pd.DataFrame.from_dict(scores, orient='index')
        _res_df['total'] = _res_df.sum(axis=1)
        _res_df.sort_values(by='total', inplace=True)
        _res_df['Rank'] = _res_df['total'].rank()
        _res_dfs.append(_res_df['Rank'])
        # print(_res_df)

    res_df = pd.concat(_res_dfs, axis=1, join="inner")
    res_df['TOTAL_Rank'] = res_df.sum(axis=1) / SIMS
    res_df['TOTAL_Avg_Rank'] = res_df['TOTAL_Rank'].rank()
    res_df.sort_values(by='TOTAL_Avg_Rank', inplace=True)
    # res_df['TOTAL_Avg_Rank'].to_csv('results4.csv')
    res_df.to_csv('resultsfull.csv')
    print(res_df)

Results

Once we have the results of the 100,000 simulations, the fun can really begin. The first thing I thought to do was find the average finishing position of each player, and then rank them. The top 25 after doing this looks like:

Player NameAvg PlaceAvg Place Rank
Jordan Spieth24.503051
Viktor Hovland30.431832
Justin Thomas30.776753
Daniel Berger31.217694
Corey Conners32.407345
Charley Hoffman32.789416
Abraham Ancer33.623967
Xander Schauffele33.858298
Collin Morikawa33.875789
Tony Finau34.4656610
Tyrrell Hatton36.2448911
Jon Rahm36.3976812
Patrick Cantlay36.6837713
Chris Kirk36.7103414
Bryson DeChambeau36.972515
Cameron Tringale37.7005516
Patrick Reed38.1231217
Emiliano Grillo38.4964518
Matthew Fitzpatrick38.6621319
Joaquin Niemann38.7405420
Webb Simpson39.1683621
Brian Harman40.5328322
Cameron Smith40.8313223
Will Zalatoris41.1028124
Max Homa42.1240525

For reasons that I discussed in the first section, I did not feel it was wise to buy into the predictive power of this interpretation. I thought the two more interesting questions were:

  • How often does each player come in first?

  • How often does each player place inside the top 5?

With these questions in mind, I generated this:

Player NameWin %Win DecTop 5sTop 5 %Top 5 Dec
Jordan Spieth5.9516.812286422.864.37
Viktor Hovland4.7321.141807318.075.53
Tyrrell Hatton4.5921.791617616.186.18
Justin Thomas4.4922.271736217.365.76
Collin Morikawa4.3323.091628116.286.14
Sam Burns4.0824.511259612.67.94
Charley Hoffman3.9225.511570515.76.37
Bryson DeChambeau3.826.321431414.316.99
Cameron Smith3.7326.811347413.477.42
Patrick Cantlay3.6227.621420514.217.04
Sungjae Im3.1132.151061010.619.43
Emiliano Grillo2.6737.451178811.798.48
Tony Finau2.245.451150511.58.7
Carlos Ortiz2.147.6277887.7912.84
Matt Wallace2.0548.7886058.611.63
Jon Rahm1.9451.551020610.219.79
Rory McIlroy1.7856.1883798.3811.93
Max Homa1.6759.8885058.5111.75
Chris Kirk1.6660.2493429.3410.71
Keegan Bradley1.4768.0369826.9814.33
Scottie Scheffler1.4369.9375097.5113.32
Patrick Reed1.3673.5382408.2412.14
Daniel Berger1.376.9293229.3210.73
Xander Schauffele1.2977.5287588.7611.42

Back to blog