Football Predictors

,
I'm finally feeling like putting all of my academic statistical knowledge to use. With the help of some other people in the contrarian community, I'm going to try a first stab at handicapping this summer (yes, summer). Before I do all the necessary coding for the stats stuff, I need to figure out what the best predictors are for success in college, pros, and/or both so the data can be acquired and manipulated.

My initial list looks like this: points scored, points allowed, yards gained, yards allowed, and interceptions thrown. The first four should be pretty obvious. The last one is the proxy for turnovers. Clearly, turnovers affect a football game's outcome, but other research has shown that fumbles are fairly random. However, interceptions depend mostly on the quality of the quarterback and are predictable from week-to-week (as any Penn State fan from 2006-7 can attest).



If you wanted to predict outcomes of games, what statistics would use and why would they be better than the ones I'm looking at trying? Also, in the comments, feel free to make fun of me for even bothering to attempt a project that will surely fail.

14 comments:

adam said...

I think what you're trying to do is retarded.

Obviously just kidding, and am happy to be a part of it...

The short answer is that I think it's better to keep an open mind about what statistics to use in analysis rather than go in with a preconceived notion of what is going to work. I think yards and turnovers (although you're right, interceptions moreso) are going to be the best predictors of success, but what we think without fact behind our thought process isn't really important at all.

There may also be other tidbits of information that have the ability to be gleaned from your research that could be interesting without practical application (regressions on pass/run ratio for example). I'm excited for results, regardless. Whether or not it proves to be useful in handicapping, the results should be enjoyable ala KenPomeroy style.

adam said...

Also, after reading a little into random fumbling, the "common knowledge" appears to be that fumbling is a skill (or lack thereof) even if the recovery is essentially random. So a team that fumbles the ball more is bound to turn the ball over more on average.

am19psu said...

The short answer is that I think it's better to keep an open mind about what statistics to use in analysis rather than go in with a preconceived notion of what is going to work.

I absolutely agree with this. It's exactly the reason why I posted it here first before putting your unemployed, collegiate ass to work.

There may also be other tidbits of information that have the ability to be gleaned from your research that could be interesting without practical application (regressions on pass/run ratio for example).

I'm hoping at the very least (assuming we can get the data) we finally put the reverse line movement argument to rest.

Whether or not it proves to be useful in handicapping, the results should be enjoyable ala KenPomeroy style.

Just one more thing to keep TheFiancee and me from actually talking to each other when we're in the same room.

am19psu said...

that fumbling is a skill (or lack thereof) even if the recovery is essentially random

Can you post a link on that? It's actually been about a year since I read up on it. I probably should have done a google search before I just threw it up there like it was fact.

rolub said...

adam's advice is solid regarding keeping an open mind; football outsiders tinker with their formulas year to year.

http://footballoutsiders.com/info/FO-basics

Recovery of a fumble, despite being the product of hard work, is almost entirely random.

Stripping the ball is a skill. Holding onto the ball is a skill. Pouncing on the ball as it is bouncing all over the place is not a skill. There is no correlation whatsoever between the percentage of fumbles recovered by a team in one year and the percentage they recover in the next year. The odds of recovery are based solely on the type of play involved, not the teams or any of their players.

Fans like to insist that specific coaches can teach their teams to recover more fumbles by swarming to the ball. Chicago's Lovie Smith, in particular, is supposed to have this ability. However, since Smith took over the Bears, their rate of fumble recovery on defense went from a league-best 76 percent to a league-worst 33 percent in 2005, then back to 67 percent in 2006. Last year, they recovered 57 percent of fumbles, close to the league average.

Fumble recovery is equally erratic on offense. In 2006, the Detroit Lions fumbled 21 times on offense and recovered just four of those fumbles. Last year, the Lions fumbled 29 times on offense--but actually had fewer turnovers because they recovered 16 of those fumbles.

Fumble recovery is a major reason why the general public overestimates or underestimates certain teams. Fumbles are huge, turning-point plays that dramatically impact wins and losses in the past, while fumble recovery percentage says absolutely nothing about a team's chances of winning games in the future. With this in mind, Football Outsiders stats treat all fumbles as equal, penalizing them based on the likelihood of each type of fumble (run, pass, sack, etc.) being recovered by the defense.

Other plays that qualify as "non-predictive events" include blocked kicks and touchdowns during turnover returns. These plays are not "lucky," per se, but they have no value whatsoever for predicting future performance.

* Pro Football Prospectus 2005, New Orleans chapter

Anonymous said...

I would be careful with interceptions, too. I realize throwing INT's is more of a reflection of a bad quarterback than a fumble, but there are still plenty of tipped balls/hail mary's/db's drop INT chances that using interceptions could be tricky.

adam said...

I absolutely agree with this. It's exactly the reason why I posted it here first before putting your unemployed, collegiate ass to work.I'm only unemployed until I get a response on the interview I had this past week, sucka. You got 'til Friday.

No, but really, the data you're going to get from the box scores is going to be as verbose as possible. Whatever we/you put together to find the most significant statistics with 10+ years of data (it's been so long that I don't remember the exact number) will certainly be whittled down a significant amount from what we/you start with... I don't know how much conjecture will help us here, although discussion is always healthy.

am19psu said...

I would be careful with interceptions, too. I realize throwing INT's is more of a reflection of a bad quarterback than a fumble, but there are still plenty of tipped balls/hail mary's/db's drop INT chances that using interceptions could be tricky.

Sure, but there is still a high (relatively speaking) week-to-week correlation between INTs.

Whatever we/you put together to find the most significant statistics with 10+ years of data (it's been so long that I don't remember the exact number) will certainly be whittled down a significant amount from what we/you start with

I'm not sure what else to even try to gauge teams' abilities for covering the spread. Yards, points, turnovers. Obviously, you want to explain as much variance as possible using as few variables as makes sense. I guess we'll find out.

By the way, if you care, I think I'm going to try logistic regression as the first prediction tool.

I'm going to keep letting comments trickle in here, and I'll hopefully send you a PM over at RMMB tonight.

moneyline said...

First off, good luck. In my opinion CFB is amongst the toughest sports to cap (read: cap, not fade the public).

If you're really going to do this then I think you need to create some sort of yards per play adjusted for opponent (offense/defense) metric. I think such a stat would go along way towards telling you how good teams really are.

Also, you're going to need to figure out how to meaningfully account for the impact that special teams have.

am19psu said...

If you're really going to do this then I think you need to create some sort of yards per play adjusted for opponent (offense/defense) metric.

I think it's going to eventually come down to that, but I'm slightly optimistic that including the spread as a predictor takes care of some of the adjustment. Maybe just a simple SOS calculation can adjust for this (like RPI SOS or something). Adam, can this be done?

I'm hoping that going all the way to a Pomeroy-style prediction scheme isn't necessary, partially because it will be a ton of work, partially because it doesn't appear to have a ton of value for hoops sides. I think you are right though, total plays will be something to toy around with.

Also, you're going to need to figure out how to meaningfully account for the impact that special teams have.

I, of course, am an idiot. I wasn't even considering what effect special teams could have. How would you even quantify that? Net punting average?

adam said...

I just used this ridiculously 90s style website to get every spread for every "spreadable" game from 1995-2008 plus home/away/neutral data for every game. We have the box scores for 1995-2007, and the ones for last year will not be difficult to collect. I'm not guaranteeing every game that I have a spread for will have a box score, but it should come close.

Re: SOS.

Anything is possible. We could do a simple least squares model on points scored (ala this pdf... maybe you were the one who showed that to me?) for a rough team strength, or we can use the point spreads in that same vein. The math/programming behind that is easy; it's just a matter of whether or not 10 games is going to be a significant enough sample size. For teams that play more than one game against I-AA opponents it will be even more difficult.

am19psu said...

First, this.

Second, the amount of data you've acquired is going to make life a lot easier than I had initially thought.

moneyline said...

"partially because it will be a ton of work,"

Ok then my next question is: What's the goal here? To come up with a ratings system that aids in making money, or to cap games Covers style? Not trying to be a dick, but that is a relevant question.

"partially because it doesn't appear to have a ton of value for hoops sides."

The fact that Pomeroy's numbers match up so well with what Vegas puts out there is what makes them valuable, imo. To be fair, I don't think we've done the best job of leveraging them in the past, but that is a discussion for another day.

Bottom line, if you're capping games and you're numbers differ wildly from the actual lines on a consistent basis, you're doing it wrong.

"How would you even quantify that? Net punting average?"

No clue, but you're going to have to account for a bit more than just punting.

am19psu said...

"What's the goal here? To come up with a ratings system that aids in making money, or to cap games Covers style? Not trying to be a dick, but that is a relevant question."

Neither, actually. In stats, you can quantify a decision boundary (in this case, wager on the favorite or the dog) through a number of different means. For example, logistic regression, linear/quadratic discriminant analysis, support vector machines, etc. The method I'm hoping to employ will output the estimated probability that the favorite will cover. It's similar to, but a level above the approach used in the academic papers I linked that showed that betting markets are efficient.

Re: the amount of work comment, I should probably elaborate. I'm not against putting time and effort into this, but I just don't have enough time to teach myself database management, which would be required to put a KPom-type system in use. Adam has graciously offered to help me with data collection for this project, but it is probably 10% of the effort required to get a true KPom-prediction system.

The fact that Pomeroy's numbers match up so well with what Vegas puts out there is what makes them valuable, imo. To be fair, I don't think we've done the best job of leveraging them in the past, but that is a discussion for another day.

I agree. I haven't figured out a way to leverage Pomeroy's numbers, but they are obviously well correlated to the betting spread. If anyone actually has game-by-game data, I'd be happy to analyze it, but I don't think such data exists currently.

Back to football, my goal isn't to determine what the final score is going to be, only which side is going to cover. I mean, it doesn't make any difference to the bottom line whether a team covers by 1 or 30. That's why I'm trying a binary approach to the stats.