Correlation and causation between two or more variables are more spoken than understood. We have all heard the phrase, “correlation does not imply causation.” But what does this mean statistically, and how are the two connected to and separated from each other? Let’s find out.

In the world of statistics, two or more variables are said to be correlated if the change in the value of one (increase or decrease) changes the value of the other variables, even if in the opposite direction. A salient example would be the variables ‘income’ and ‘work hours.’ As the number of work hours increases for a person, the income earned becomes higher as they get paid more. Hence we can safely say that these variables are correlated. On the other hand, if we consider the prices of products and buying power, as costs increase, a person’s purchasing power decreases since he cannot afford those products anymore. These two variables are correlated as well, albeit inversely.

Correlation is expressed as a number in statistics and describes the direction and size of the relationship between two or more variables. However, where correlation differs from causation is the fact that even though the values of two variables change in relation to each other, the change in one does not necessarily cause the change in the other.

The causal relationship, expressed as cause and effect, requires one event or change to cause another change or event to occur. For simpler events, it is easy to identify the correct relationship between variables. For example, in an ACS Publication by Stephen Johnson, it was recognized that there was an interesting correlation between the number of fresh lemons imported from Mexico and the number of US highway fatalities. However, the movement or consumption of lemons did not cause deaths.

In reality, identifying the correct relationship between events is not as simple. Statistical researchers have been working hard for decades to determine correlation, causation, and the degree to which they apply to two or more variables. Some complex cases under observation are:

- Whether vegan food leads to a longer and healthier life.
- If education level impacts the health of a person.
- The effects of fuel consumption and modern lifestyle on climate change.

These problems are being explored to identify a correlation between these variables first. Once a correlation is established, then the efforts can be steered towards identifying if the relationship is causal or not. The result can aid governments to introduce policies and laws that can encourage the preferable outcome.

In statistics, a correlation coefficient (r) is used to describe the degree of relationship between two variables. (r) is a single number ranging between -1.0 and +1.0. The negative and positive signs help establish the direction of the relationship. If the value of (r) is 0 then the variables do not have a relationship. One can remain constant while the other changes.

If the coefficient indicates a negative value, that means the variables have an inverse relationship with each other. When one increases, the other decreases. If the coefficient indicates a positive value, then the variables move in the same direction, when one increases, the other increases as well.

The correlation coefficient works best when the relationship between two variables is linear. Let us go back to an earlier example. If a person earns a fixed hourly wage, then an increase in the number of hours worked increases the income linearly as the income earned increases at a constant rate. But if the person charges an hourly fee that decreases with each hour worked, then the relationship becomes non-linear, and the value of (r) might be very close to zero even though there is a direct correlation between the variables. For such non-linear cases, the correlation coefficient is not the best tool to use.

The value of (r) does not always provide us with the correct picture. Even if the coefficient presents a relationship between two variables, it does not indicate that the variables are directly related to each other. Rather, it is often the case that another variable is controlling the outcome of (r). For example, we may find that the sales of ice creams and warm clothes always move in the opposite direction throughout the year. But that does not mean that the two variables are related. The ambient temperature and weather affect the sales of these variables oppositely. But the value of the correlation coefficient leads us to believe that the two variables are related.

As mentioned earlier, correlation does not automatically imply causation. But that is precisely the mistake people often make when they see a correlation between two variables. However, the most effective way to determine causality between two variables is through controlled studies.

For a controlled study to be effective, a population of people is divided into two groups, the groups being as identical to each other as possible. Then the groups are subject to different treatments, and the results are assessed. For example, in medical science, one group of people might receive a placebo while the other group receives an experimental or new drug. As we cannot subject a group of people to harmful conditions intentionally, there are limits to the usefulness of controlled studies. But such hindrances are overcome by observing peoples’ responses to specific situations and any changes in result over time. The results of these studies are added to already available information to bolster the evidence of causality.

With more information at our disposal, it becomes easier to understand whether one event causes the other, whether they are related to each other or if their changes in movement are merely coincidental.

Only sponsors can view this page...