The Independence Assumption and its Foe, Spatial Correlation
When animals like these wolves travel in packs, spotting one individual means we’re more likely to spot another soon after. So how do we come up with a reliable population estimate in situations like these? (Image Credit: Eric Kilby, CC BY-SA 2.0, Image Cropped)
The thought of an ecologist may conjure the image of a scientist spending their time out in the field counting birds, looking for moss, studying mushrooms. Yet whilst field ecologists remain an integral part of modern ecology, the reality is that much of the discipline has come to rely on complex models. These are the processes which allow us to estimate figures like the 1 billion animals that have died in the recent Australian bushfires, or the potential spread of species further polewards as climate change warms our planet.
Yet a lot of these models rely on one big assumption – independence. For instance, let’s say we’re trying to link the effect of temperature to the abundance of a fish species in the USA. Our model may assume that our sites are independent of one another. However, if the data all comes from one series of interconnected lakes, which fish travel freely between, influencing each other, we could draw some seriously flawed conclusions that won’t apply to other lake systems, and that could lead to some very poor management decisions.
So today, let’s talk about independence, dependence, and their consequences. What does independence look like? What is the relationship between space and time dependent processes? How can we get around this assumption if we have reason to believe our process of interest has dependence?
What does independent data look like?
Independence is a modeller’s best friend. The independence assumption allows us to borrow information across observations, decompose a complicated likelihood into a nice and clean product, and eliminate lots of pesky parameters that otherwise would have to be estimated. In an ecological setting, this simplifying assumption might take the form of assuming that sites where we collect data about species occurrence or abundance are independent from one another or that the locations of individuals are independent of one another.
If individuals were distributed independently across space, their locations might look like this:
You might have modelled this data with a homogeneous Poisson Process.
However, there are many ecological reasons for this assumption to be broken. Individuals might cluster, e.g. if they travel in packs. If you spot one individual, you are more likely to spot another nearby. If you assumed independence you might underestimate the total number of species in the area.
They might “repel” one another since they compete for resources. If you spot one individual, you are less likely to spot another nearby. If you assumed independence you might overestimate the total number of species in the area.
If the non-independence is induced by a similar response to an aspect of their terrain, or some other property we can measure, we can loosen our assumption to conditional independence. This means that once we account for a bunch of stuff (more formally, covariates), independence reigns. You may have used an inhomogeneous Poisson Process to handle this. Without modeling the relationship between abundance and treecover, we may have confounded the two.
But what if the dependence is not due to shared relationships to covariates but instead is a function of the locations of the sites or individuals themselves? This type of correlation is spatial in nature and is a bit trickier to account for.
Accounting For Spatial Correlation
Without making any assumptions about how sites or individuals are related to one another, we would have to treat each differently. That would be a lot of parameters to estimate! However, we can exchange one assumption (independence) for another one (the form of the spatial correlation).
But before we think too hard about space, let’s think about time. In data collected over time, correlation occurs between observations. The temperature tomorrow is correlated with the temperature today – if it’s hot today, it is more likely to be a similar temperature than dropping dramatically to an icy temperature. The temperature tomorrow is correlated with yesterday’s temperature too, but we assume that this relationship is weaker due to the longer period of time between the two observations. Add an extra dimension, and spatial correlation works the same way. The relationship between two locations is impacted by the distance between them. We assume locations closer to one another are more strongly correlated.
There are many ways to model spatial correlation. These approaches often smooth the actual spatial relationship so that we can represent the complicated correlation by only a few more parameters. We can think of the parameters as controlling how quickly the dependency drops off as distance increases. Spatial random fields are often the workhorse for this type of approach.
For count and continuous data its easier to conceptualize spatial correlation, and perhaps even spot it in a plot, but for binary data, intuition may be a bit lacking. A new paper in Methods of Ecology and Evolution tackles the binary case. The authors introduce ecologists to the lorelogram, a graphical tool, to provide a way to assess dependency in binary data, such as species distribution data. There is even an R package of the same name to make it easy to get going, although sadly there is no gratuitous capital R in the middle of the package name.
Have a quantitative term or concept that mystifies you? Want it explained simply? Suggest a topic for next month → @sastoudt