I've got time to put the full record update up since games don't start again until Thursday, so I figured I would add to this post from last week.
Since I figured out the expected wins of each team from the totals posted at BetUS, all I needed was a standard deviation to come up with a distribution of the probability of each team having a particular number of wins. I did this by using last year's totals and actual results to come up with a root mean squared error estimate of the standard deviation, which was around 2.16.
I'll let Jonny analyze each particular team, but I came up with an unexpected and almost assuredly erroneous result. The numbers show the expected number of teams that should go undefeated is 4.27.
There a couple of reasons that immediately spring to mind why this number is wrong. First off, my methodology could be wrong. I used a normal distribution to estimate probabilties for discrete data. While not terrible, it's still not good (for math geeks, this is analogous to using the trapezoidal rule for estimating integrals).
Also, it may not be a good idea to use a symmetric, normal distribution for totals that are 10, 11, or 12, since the tails of a normal distribution go to infinity and you are buttressing against the total possible number of wins.
Another reason would be that last year's RMSE is not representative of the true standard deviation of the distribution. The last thing that comes to mind is the books shading the lines high for the expected good teams, expecting to take a lot of over action.
To use a further example, last year Moneyline, when he was still running a blog, estimated USC's chances of going undefeated at 14.8% and their chances of one loss at 32.4%. Using this methodology, 2008 USC had a 31.3% chance of going undefeated and a 17.7% chance of one loss. Obviously, Moneyline's numbers are way closer to reality than these are.
There is quite a bit of room for improvement here, I'm just not sure where.
Subscribe to:
Post Comments (Atom)
4 comments:
I hope you at least used a clipped normal rather than actually allowing the win total to exceed the number of games played.
As a still simple but more realistic method, why not determine the win % needed to match the Expected Wins, and then use that to predict the distribution of wins by assuming a sequence of Bernoulli trials?
This is obviously flawed as well, since the probability of exactly one team winning any particular game is unlikely to equal one, but it's something.
I hope you at least used a clipped normal rather than actually allowing the win total to exceed the number of games played.
Not clipped, I just assumed the P > 11.5 was the probability of winning all 12 games. I started thinking about ways of assigning a weighting to the "leftover" probability so it doesn't all go in the last bin, but I'm tired and I'll think about it more tomorrow.
As a still simple but more realistic method, why not determine the win % needed to match the Expected Wins, and then use that to predict the distribution of wins by assuming a sequence of Bernoulli trials?
If you want to use the binomial distribution, Moneyline had the right idea last year by assigning spreads to each game. I was hoping I could get a robust solution from this data without going through all of the work that he did last year. It may be that it's not possible, particularly if the distribution changes depending on the line the books set.
Moneyline had the right idea last year by assigning spreads to each game. I was hoping I could get a robust solution from this data without going through all of the work that he did last year.
I have a mild idea about how to handle this. Is there anywhere that I can get (a) your win total data and (b) a complete 2009 Div I-A schedule, preferably in .csv format?
As a still simple but more realistic method, why not determine the win % needed to match the Expected Wins, and then use that to predict the distribution of wins by assuming a sequence of Bernoulli trials?
The more I think about this, the more I think it is worthwhile to explore further. It's going to be a bit more complex than that, because I need an accurate estimate of E(wins), so there will have to be some reverse engineering, but there may be some merit here.
I have a mild idea about how to handle this. Is there anywhere that I can get (a) your win total data and (b) a complete 2009 Div I-A schedule, preferably in .csv format?
Send me your email address today in a PM over at RMMB, and I'll get it to you after work.
Post a Comment