Polling 101: Correlation

Correlation and dependence are two key statistical concept to polling.  Understanding the differences between these two concepts are critical if you are to understand the results of your labor.

Covariance

We’ve already defined the variance of a random variable as E[(X-µx)2].  This value measures the spread of values it takes on about the mean.  Now let’s define the Covariance between two random variables, X and Y, as:

Equations by mathURL.com.

Before we begin using this equation on actual data, it requires closer inspection.  X and Y represent the result of an endless series of experiments, each resulting in a number.  These numbers, on average, are centered around their respective means.  Therefor, (X – µx) and (Y – µy) vary similarly around 0.

Consider an X and Y that track each other closely.  That is, when X takes on its largest value,  Y typically takes on its largest value too. When X is relatively small, Y is too, and so on.  Similarly,  (X – µx) and (Y – µy) track each other but, irregardless of the sign of X and Y, they range between positive and negative values about 0.  Consequently, (X – µx) (Y – µy) would generate  large positive values, as the product of either 2 large positive numbers or 2 large negative numbers.  Now consider an X and Y that are completely “out of sync”.  That is, when X is at its highest value, Y is typically at its lowest, and so on.  Now, (X – µx) (Y – µy) would be the product of a large positive value and a large negative value (or vice versa), resulting in a large negative value.  Finally,  consider an X and Y whose relationship continuously shifts from perfectly in sync, to perfectly out of sync.  (X – µx) (Y – µy) would some times be positive, sometimes be negative, so the expected value of (X – µx) (Y – µy) would tend to 0.

The covariance between two random variables then is a measure of how synchronized, or correlated, they are from sample to sample.  Large positive covariances mean X and Y are highly synchronized.  Large negative values mean X and Y are highly out of synchronization.  A covariance near 0 means X and Y are uncorrelated.

To aid in the calculation of covariances, some simply algebra can produce an alternate equation:

This form is generally easier to calculate given sample data.

Correlation

As you could imagine, there’s an upper limit to how correlated 2 random variables can be.  It can be shown that the maximum value of a Cov(X,Y) is ± σxσy, the product of X and Y’s standard deviations.  Therefor, the covariance can be normalized to produce unitless values between 0 and ±1, given us a consistent measure of correlation among diverse sets of random variables.  This correlation coefficient is:

An Example

To illustrate the use of correlation in data analysis, let’s look at the plausible (but totally fictitious) results of a poll conducted on 1000 Americans, asking them two simple questions:

  • What was your gross income last year? (X)
  • Do you support federal statutes implementing ‘Cap and Trade’ standards? (Y)

Do only rich people reject emissions regulations?  Calculating the correlation coefficient can give us a quantitative answer to the question.  Here’s the raw data:

Income No (1) In Part (2)
Yes (3)
Total
< $20k (1)
12 10 114 136
$20k-50k (2)
61 23 95 179
$50k-$80k (3)
73 54 60 187
$80k-$120k (4)
82 54 12 148
$120k-$200k (5)
110 52 6 168
> $200k (6)
119 61 2 182
Total 457 254 289 1000

So, 12 respondents, or .012%,  reported an income of < $20,000 and ‘No’ to cap-and-trade standards.   457, or 45.7%, in total thought cap-and-trade is a bad idea.

To start, we need to define our random variables, X and Y, and the numeric mapping we’ll use for them.  Let X be the answer to the income question , and Y be the answer to the cap-and-trade question.  As the table shows, we’ve assigned arbitrary numbers to each possible response to each question.  We can now make some basic calculations on the data.

With these values, we can now calculate the covariance and correlation coefficient between X and Y.

While it may not be clear from the covariance alone, compared to the possible range 0 to ±1, the correlation coefficient shows us there’s a relatively strong negative correlation between these two poll results.  That is, the larger the answer given from the income question, the less likely the respondent would support cap-and-trade legislation.  From our fictional survey, we have confirmed (to a degree) that rich people are less likely to support cap-and-trade than poor ones.

Although we’ve taken a critical step in understanding our polling data with the use of correlation, there’s still much more to consider.  To complete our toolset, we must look at population sampling, hypothesis testing, and the difference between correlation and dependence.

3 Comments

Leave a Reply