I’ve been playing with Kaggle, a new website that facilitates data analysis competitions. You compete with others to model some data in order to make predictions. I had some fun with the Don’t Get Kicked! competition, which has $10,000 in prizes for the best model that can predict when a used car bought at auction is a dud. After a couple of hours playing with logistic regression models, I’m in 64th place, with a goodness-of-fit score about 11% behind the overall leader.
I think Kaggle is a fun idea, but have some concerns about using it for serious analysis. The competitions are evaluated using a single goodness-of-fit score, like RMSE. This encourages large complex models that fit the data a little better but may be difficult to actually use for forecasting. With regression modelling, adding another variable always improves your goodness of fit a little bit. In the Don’t Get Kicked! competition, the dataset has around 70,000 observations, so there’s plenty of scope to include large numbers of statistically insignificant variables to improve your goodness of fit. I suspect that competition will be won by whoever has a computer powerful enough to estimate a model with all the possible variables and their interactions included.
Of course this is probably the only way that the modelling competitions can be judged automatically. But in real world modelling, other considerations such as statistical significance and diagnostic tests are equally or more important than pure goodness of fit. The Kaggle competitions encourage brute force strength but won’t necessarily result in models that are very useful for those who are sponsoring the prizes. Better to hire an experienced consultant to build you a beautiful bespoke model :)