By Eddy Yeo
Change of Plans
Unfortunately, it seems that all the available comprehensive horse racing datasets cost money. So instead of discussing concrete results (as the author is too cheap to purchase the data), we will focus on the more general concepts behind the methodolgy in this second and final article in the series. The profit will have to be found in other sports. This highlights one of the practical issues of data science — collecting data. Such problems are not very fun to solve, but have to be dealt with nonetheless (or not, as in this case).
Art vs Science
While we use the term data science to describe data analysis, it is a bit of a misnomer. There is a lot of seemingly subjective qualitative analysis that is based on the domain knowledge and feel of the data scientist, that makes this skill somewhat of an art.
The reason for this is that there is no learning without bias. Note that those in the machine learning community have a more precise definition of bias. I use it more loosely, in a way that roughly corresponds to “assumption”. When we make predictions with any model, we are making some assumptions about the unseen data points with respect to those that have been seen. Without them, the unseen data points can take on arbitrary values, so there is no way to predict them. The quality of the predictions depend greatly on the validity of the assumptions.
Coming back to horse racing, the choice of using a neural network already imposes a bias on the problem, although not a relatively strong one as neural networks are expressive if the structure is complex enough. The degree of the bias depends on the structure of the network. Here is an intuitive geometric explanation of how structure imposes a bias (if you can get past the math).
Tuning and Overfitting
The next question is, since we have to impose some sort of bias in our prediction method, what kind of bias should it be? In the past, before we had the speed of computers, we could only hand-tune statistical models. This only permitted models which have closed-form solutions, like linear regression in low dimension. These models impose very strong biases, increasing the error rate in our predictions. It also prevented us from modelling complex phenomena, for which no simple model fits.
But a method like neural networks, for which no closed-form solution exists, is practically impossible to hand-tune. So what we do is impose a weaker higher-level bias, and let the computer tune the parameters of the model so that the error of our predictions against our training dataset is minimized. This allows us to model more complex phenomena. In a way, we let our data determine our bias.
Since computers can tune the model to fit the data, why can’t they decide on the model as well? Firstly, as mentioned earlier, there is necessarily some kind of qualitative bias in any prediction method. However, even if we minimize that bias, there is actually a much more practical problem. This is the problem of overfitting.
Any dataset is a limited perspective of the data generating process we are trying to model. If our initial bias is too weak, and we give too much freedom to the computer to tune our model to fit the data, we end up getting a model which fits the dataset very well, but may not fit the overall process very well.
The blind men above are trying to model the elephant, and each one will construct a different view based on their perspective (dataset). If their preconceptions of the elephant are too weak (weak bias), each one will have an extremely inaccurate view of what an elephant is. But if for instance they know the general shape of an elephant (stronger bias), they can accurately predict the true shape based on the size of the part they are touching and maybe the texture.
One very important part of modelling which I’m not going to discuss much is choosing which features of the data to use. This greatly depends on the available dataset, its size, its quality… etc. For example, for horse racing, we should take into account weather and age of horse, but probably not the day of the week of the race (but even then, who knows? There may be some weird correlation). This is a very complex topic in its own right.
How does it all related to sports?
All these concepts and techniques come into play when doing sports analytics. They have to be merged with strong domain knowledge in sports, so that appropriate assumptions are made in the modelling process. Perhaps it was good that there were no datasets available for horse racing, because I don’t really know much about it!