The How, Why, and When of Transforming Data

We’ve been out in the field, painstakingly collecting each butterfly and measuring its body length and wingspan. Now is the moment of truth. We’re about to make a plot and see if the assumptions we make about the relationship between the two measurements are backed up by a linear regression. Is the relationship between length and wingspan what we’d expect? Will a linear model be appropriate or are we going to have to break out the heavier machinery?

Our data, alas. 

The very first rule of fitting lines to points and trusting the inference about the significance of the resulting relationship is that the points should in fact look like a straight line. There’s a clear curve in the above plot. Are we sunk?

Not so fast! What if we adjusted the data a bit to try to make it look more linear? The “bulging rule” often attributed to Mosteller and Tukey provides some guidance in choosing with which of the many transformations available to us we should start. 

This advice is often paired with a visual guide to help us remember what to do. A circle can be broken into four parts, and each quadrant corresponds to a particular set of recommendations for transforming the explanatory and/or response variable. Now we just need to decide which of the four quadrants of the circle our data most resembles and try out the recommended transformations to see what helps us reach linearity. 

Example of Tukey’s “Circle of Transformations”

Based on this transformation guide, we see that our data falls in Quadrant I of the “circle of transformations”. In this quadrant we are advised to try squaring one (or both) of the variables. Let’s see how that plays out. 

One success, one failure.

It looks like modeling y as a function of x^2 would be fairly linear. y^2 as a function of x, not so much. But is transforming our data like this cheating? Not as long as we remember to update our interpretation of the model in light of what we did. And the same basic principles of transformations work for generalized linear models that help us work with data that is linear, just on a different scale (the link scale to be precise).

What if our data were to have fallen in a different quadrant? Quadrant II recommends squaring or cubing the response variable and/or taking a square root or log transform of your explanatory variable. The latter might be a problem if our covariate has negative values. A quick hack is to shift the x values enough so that everything is positive before applying the transformation. 

It’s important to note that the more we transform our data, the more we need to update our interpretation when we write about it. So the above approach can muddy interpretation a bit (and we have to remember to shift everything back at some point), so I’m not advocating for it all of the time, but it’ll get the job done for the purposes of this example.

In this case, it looks like the square root or log transform helps us out most with linearity.

Quadrant III recommends taking a square root or log transform of your explanatory variable and/or your response variable. Again, I’ll use a quick shift, making our negative values positive so that we can see what’s going on. It can be challenging to adjust your interpretation based on the log transformation in particular, so you may want to check out our post on interpreting the coefficients in your model for some additional guidance. 

So many transformations, so little time.

And last but not least Quadrant IV recommends taking a square root of your response variable and/or squaring your explanatory variable or your response variable.

Both look pretty linear to me. One might pick the one with a simpler interpretation, or avoid the shift in the square root hack by squaring x. 

Now we’re up to speed on transformations to help make our data fall into line. Some of these transformations may also help other conditions be better met including the constant variance requirement. But a lingering question remains. At what stage of the analysis do we perform the transformation? 

As always the statistical wisdom is, it depends. A good rule of thumb is to do the transformation at the level that you want to do inference on. So if we plan on building a model based on our raw data, we should transform the data right away. But if we are doing analysis at the group level with some kind of aggregated data, like time-period averages, it makes sense to do transformations on the aggregated scale. The act of aggregation may also make the data “look” more linear and avoid the issue altogether. 

Data transformation can be an intimidating process. It’s hard to know when to get started (and when to stop). Hopefully the examples above give you a good starter guide to transforming your data when it’s giving you a pattern, but not quite the one you expect. Just remember that you need to update your interpretation of the relationship to match your transformations. Words matter!

Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month →  @sastoudt

Title Image Credit: Bernard Spragg, CC0 1.0


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s