Posts tagged ‘dependence’

Polling 101: Correlation

Correlation and dependence are two key statistical concept to polling.  Understanding the differences between these two concepts are critical if you are to understand the results of your labor.

Covariance

We’ve already defined the variance of a random variable as E[(X-µx)2].  This value measures the spread of values it takes on about the mean.  Now let’s define the Covariance between two random variables, X and Y, as:

Equations by mathURL.com.

Before we begin using this equation on actual data, it requires closer inspection.  X and Y represent the result of an endless series of experiments, each resulting in a number.  These numbers, on average, are centered around their respective means.  Therefor, (X – µx) and (Y – µy) vary similarly around 0.

Consider an X and Y that track each other closely.  That is, when X takes on its largest value,  Y typically takes on its largest value too. When X is relatively small, Y is too, and so on.  Similarly,  (X – µx) and (Y – µy) track each other but, irregardless of the sign of X and Y, they range between positive and negative values about 0.  Consequently, (X – µx) (Y – µy) would generate  large positive values, as the product of either 2 large positive numbers or 2 large negative numbers.  Now consider an X and Y that are completely “out of sync”.  That is, when X is at its highest value, Y is typically at its lowest, and so on.  Now, (X – µx) (Y – µy) would be the product of a large positive value and a large negative value (or vice versa), resulting in a large negative value.  Finally,  consider an X and Y whose relationship continuously shifts from perfectly in sync, to perfectly out of sync.  (X – µx) (Y – µy) would some times be positive, sometimes be negative, so the expected value of (X – µx) (Y – µy) would tend to 0.

The covariance between two random variables then is a measure of how synchronized, or correlated, they are from sample to sample.  Large positive covariances mean X and Y are highly synchronized.  Large negative values mean X and Y are highly out of synchronization.  A covariance near 0 means X and Y are uncorrelated.

To aid in the calculation of covariances, some simply algebra can produce an alternate equation:

This form is generally easier to calculate given sample data.

Correlation

As you could imagine, there’s an upper limit to how correlated 2 random variables can be.  It can be shown that the maximum value of a Cov(X,Y) is ± σxσy, the product of X and Y’s standard deviations.  Therefor, the covariance can be normalized to produce unitless values between 0 and ±1, given us a consistent measure of correlation among diverse sets of random variables.  This correlation coefficient is:

An Example

To illustrate the use of correlation in data analysis, let’s look at the plausible (but totally fictitious) results of a poll conducted on 1000 Americans, asking them two simple questions:

  • What was your gross income last year? (X)
  • Do you support federal statutes implementing ‘Cap and Trade’ standards? (Y)

Do only rich people reject emissions regulations?  Calculating the correlation coefficient can give us a quantitative answer to the question.  Here’s the raw data:

Income No (1) In Part (2)
Yes (3)
Total
< $20k (1)
12 10 114 136
$20k-50k (2)
61 23 95 179
$50k-$80k (3)
73 54 60 187
$80k-$120k (4)
82 54 12 148
$120k-$200k (5)
110 52 6 168
> $200k (6)
119 61 2 182
Total 457 254 289 1000

So, 12 respondents, or .012%,  reported an income of < $20,000 and ‘No’ to cap-and-trade standards.   457, or 45.7%, in total thought cap-and-trade is a bad idea.

To start, we need to define our random variables, X and Y, and the numeric mapping we’ll use for them.  Let X be the answer to the income question , and Y be the answer to the cap-and-trade question.  As the table shows, we’ve assigned arbitrary numbers to each possible response to each question.  We can now make some basic calculations on the data.

With these values, we can now calculate the covariance and correlation coefficient between X and Y.

While it may not be clear from the covariance alone, compared to the possible range 0 to ±1, the correlation coefficient shows us there’s a relatively strong negative correlation between these two poll results.  That is, the larger the answer given from the income question, the less likely the respondent would support cap-and-trade legislation.  From our fictional survey, we have confirmed (to a degree) that rich people are less likely to support cap-and-trade than poor ones.

Although we’ve taken a critical step in understanding our polling data with the use of correlation, there’s still much more to consider.  To complete our toolset, we must look at population sampling, hypothesis testing, and the difference between correlation and dependence.

Polling 101: Dependence

In our ongoing series on polling theory and statistical analysis, we pick up where we left off from our discussion of probability, and introduce 2 new concepts: conditional probability and independence.

Conditional Probability

We’ve already seen that probability is defined as a value between 0 and 1 that represents the relatively likelihood that a particular event will occur.  If we define a random variable X to denote the roll of a single fair die, and the event A to denote the outcome of an even number, then:

formulas courtesy of mathURL.com

since there are 6 possible outcomes, 3 of which yield an even number.

Conditional probabilities allow us to express the probabilities of one event, assuming another event has occurred.  For example, if we write Pr(B|A), we are describing the probability that event B occurred, given event A has occurred.  Using our original event A above, if we define event B as rolling a 2, then:

since there are 3 possible outcomes (given the roll was an even number), 1 of which is a 2.

Finally, we can calculate the probability both A and B will occur.  This is the intersection of the events A and B, and its probability is calculated as:

So the probability of A and B is the probability of A, times the probability of B given A.  In our single die example, we can calculate the probability that the die is 2 and that the die is even as:

This should make sense since there’s only 1 roll of the 6 possibilities that make both events true.

Dependence

Sometimes, the probably of an event given another event is the same as the probability of that event alone.  When:

we say that M and N are independent.  Returning to the probability of 2 intersecting events:

if and only if A and B are independent

This doesn’t hold for our original events A and B (Pr(B|A) = 1/3, Pr(B) = 1/6), but let’s define a second random variable Y which corresponds to the roll of a second fair die, with event M denoting that the second die is even, and event N denoting that the second die rolls a 2.  So the probability of rolling two 2’s is:

Here we used the reciprocity of the “if and only if” clause; in other words, the rule can be reversed if necessary.  Common sense tells us that the roll of the second die is in no way influenced by the roll of the first and therefor they are independent of each other.  As a result, we can simply multiply the individual probabilities.

Event Unions

Since we’re talking about the intersection of 2 events, we should probably cover the union of them as well.  Where the intersection denotes events A and B occurring, union denotes A or B occurring.  To find the probability of a union of 2 events, the equation is:

To illustrate this relationship, consider a Venn diagram of events A and B:

union

If we simply added Pr(A) and Pr(B), we’ll include the area Pr(AB) twice, so we we need to subtract one out.  If we define event C as the value of the first die greater than or equal to 3, we can calculate:

Of the 6 possible outcomes, only rolling a 1 fails to satisfy the event A+B.  Draw a Venn diagram to convince yourself.

Finally, if Pr(AB) = 0, then Pr(A+B) = Pr(A) + Pr(B).  In this case, we call events A and B mutually exclusive.  Consider Pr(B+C).

Class dismissed.