Finding Balance on the Bias-Variance Seesaw
Building models is a tricky business. There are lots of decisions involved and competing motivations. Say we are an ecologist studying owl abundance in a park near our school. Our primary goal may be to have a good understanding of what is going on in our data. We don’t want to miss any important relationships between abundance and measurable factors about the landscape. Like if we didn’t include tree cover as an explanatory variable, we might have a model that is underfit since that variable would give us potential information about the availability of spots for owls to nest.
However, as scientists trying to add to a broader body of literature, we also want our understanding of owl abundance to be useful to those with other data from a similar context. It would be nice if whatever owl abundance model we come up with would work well for us but could also be useful to ecologists studying owls in nearby parks or even in the same park next year.
But of course nature is, well, complicated. When we start building our model we may be tempted to try to account for every piece of information about that park, so that we can try to model the abundance of our owls perfectly. And these days it’s easier than ever to throw in a bunch of explanatory variables into a model to try to help account for all of this complexity because so much more information is available to us. We have data about every aspect of the weather, sometimes even on an hourly basis. We have data about the landscape, from soil moisture to tree cover to elevation. However, we can get to the point where all of that accounting for potential variability can lead us astray when trying to have our model generalize to other scenarios. We may overfit to our particular scenario such that our model is no longer relevant beyond our particular case.
How can we balance these competing goals? It’s a tradeoff, a bias-variance tradeoff to be exact. Consider the mean squared error (MSE), a quantity we often want to minimize to ensure our model is working well. The MSE is the average of the squared differences between the observed values and the predicted values. If this value is small, it means our predictions line up with what we see. When we do the math, we find that the mean squared error can be decomposed as the sum of a variance and squared bias terms. Bias occurs when we miss an important relationship in our model (underfitting). Variance occurs when a model is sensitive to small deviations in the data. Say we jitter our data a little bit and get wildly different predictions. That seems unideal. Remember that any data we have comes with some random noise. If we overfit the model, we might end up accounting for that noise too, so looking at a new dataset with its own noise can cause us grief. We cannot both minimize the bias and the variance, but understanding the relationship between the two can help us make better modeling decisions.
For example, it is possible to increase the bias but decrease the variance enough that the mean squared error decreases overall. One strategy to do this, and thereby avoid overfitting, is regularization. This is an approach that penalizes covariate coefficients that are too large, favoring simple models over fancier ones. The motivation for this penalization is that an overfit model places too much importance on particular relationships in the model at the expense of covariates that are left out of the model. (Read more about the intuition here). These approaches effectively “shrink” coefficients in a model, and depending on the approach can even “snap” them to zero.
It’s important to realize that regularized models are no longer unbiased though, so doing inference on the particular coefficients is less straightforward. If we really care about the relationship between precipitation in August and owl abundance, this may not be the strategy we want to take. However, if we are trying to have more precise predictions across multiple owl scenarios, we might sacrifice a bit of knowledge about temperature effects for a better idea of owl abundances across different habitats. This might be true if we are predicting the abundance of owls so that we know if the species passes a particular threshold for being endangered or even invasive.
We can even assess if we have overfit our model by taking a tip from the more machine learning-focused literature and splitting our data into training and testing sets. We first build the model on some of our data and then try making predictions for the remaining data. If our mean-squared error is noticeably worse in our test set than our training set, that is a sign that we may have accounted for too much variation in our particular data set. We may want to simplify your approach to ensure it is more generalizable. What if we can’t afford to sacrifice some of our data for a test set? All of those owl counts are hard won. Then perhaps we want to stick to the basics, borrow Occam’s Razor, and keep our model simple in the first place.
To be clear, there is no magic formula for the perfect model selection technique. The key is to think about our goals and evaluate each modeling choice in turn. We should consider if adding a particular piece of information to the model is helping or hurting us both in the short term (in my data set) and in the long term (in others’ related data sets). As we get more experience balancing the bias-variance tradeoff, we’ll have more intuition about when complexity is necessary to describe our ecological phenomenon or superfluous.
Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month → @sastoudt. You can read more about Sara and her stats work on her Ecology for the Masses profile here.