Posts tagged ‘statistics’

Polling 101: Visualizing Correlation

We’ve seen that we can quantify how two random processes, or random variables,  synchronize with each other by calculating the correlation between them.  A correlation coefficient near +1 means the two random variables are positively synchronized or correlated.  -1 means they are negatively correlated.  Coefficients near 0 means the two random variables are uncorrelated, and don’t track each other at all.  In our hypothetical poll, we calculated the coefficient between two polling questions as -0.55, a relatively strong negative correlation.

Plotting Correlated Random Variables

While calculating a single number to assess correlation is a handy method for quickly comparing two random variables, it’s also helpful to plot the data on a graph, and inspect the relationship between the two variables.  While a standard ’scatter plot’ works fine for most statistical analyses,  we find that more sophisticated methods are required to visualize polling data.

Consider an investigation that studies the link between stress and high blood pressure.  If you were to interview 20 volunteers and measure the level of stress in their lives along with their average blood pressure, you could use the data to create the following plot.

correlation

Here, each point represents a single test subject, with its x value representing their level of stress and y value representing their blood pressure.  Even without calculating the correlation coefficient, we can easily see that there’s a relatively strong correlation between the two.  The plot also shows the line of best fit, calculated from the data points (we’ll show how to determine that line in a later post).

But what would we see if we form a similar scatter plot on our fictitious polling data, measuring respondents  support of Cap-and-Trade legislation and income bracket.

corr1

Plots created with gnuplot.

3D Surface Plots

The simple scatter plot above is not very informative.  Because we have only 6×3=18 possible outcomes, many individual outcomes were simply stacked up on one of these 18 possibilities.  If we introduced a 3rd axis however, we could evaluate the relative strength of each possible value.

corr2

That’s a bit better.  Here we see the 18 possible results as defined by their x and y coordinates, and the frequency of respondents who selected them by the z axis.  If we look closely enough, we might see a ridge running from (Income=1, Cap-and-Trade=3) to (Income=6, Cap-and-Trade=1).  This is the 3-dimensional equivalent of that cluster of points around a straight line and indicates correlation.

To further aid visual analysis, we must fill in the gaps, or interpolate, between the given data points with approximate values.  We can then color-code the surface to further illustrate the relationship between the data points.

corr3

corr4

The saddle shape of these plots shows the ridge that indicates a fairly strong correlation between these 2 results.  Ultimately, we can turn the right-hand surface plot into a 2-dimensional map and simply follow the colors.

corr5

So, the rule for interpreting map plots that compare 2 result sets is simple.  If you can draw a straight line across the plot by touching only hot colors (i.e. those that represent high-frequency result pairs), you’ve identified correlated events.  This technique, combined with the calculation of the correlation coefficient, will give you a good clue that the 2 questions are correlated.

Polling 101: Correlation

Correlation and dependence are two key statistical concept to polling.  Understanding the differences between these two concepts are critical if you are to understand the results of your labor.

Covariance

We’ve already defined the variance of a random variable as E[(X-µx)2].  This value measures the spread of values it takes on about the mean.  Now let’s define the Covariance between two random variables, X and Y, as:

Equations by mathURL.com.

Before we begin using this equation on actual data, it requires closer inspection.  X and Y represent the result of an endless series of experiments, each resulting in a number.  These numbers, on average, are centered around their respective means.  Therefor, (X – µx) and (Y – µy) vary similarly around 0.

Consider an X and Y that track each other closely.  That is, when X takes on its largest value,  Y typically takes on its largest value too. When X is relatively small, Y is too, and so on.  Similarly,  (X – µx) and (Y – µy) track each other but, irregardless of the sign of X and Y, they range between positive and negative values about 0.  Consequently, (X – µx) (Y – µy) would generate  large positive values, as the product of either 2 large positive numbers or 2 large negative numbers.  Now consider an X and Y that are completely “out of sync”.  That is, when X is at its highest value, Y is typically at its lowest, and so on.  Now, (X – µx) (Y – µy) would be the product of a large positive value and a large negative value (or vice versa), resulting in a large negative value.  Finally,  consider an X and Y whose relationship continuously shifts from perfectly in sync, to perfectly out of sync.  (X – µx) (Y – µy) would some times be positive, sometimes be negative, so the expected value of (X – µx) (Y – µy) would tend to 0.

The covariance between two random variables then is a measure of how synchronized, or correlated, they are from sample to sample.  Large positive covariances mean X and Y are highly synchronized.  Large negative values mean X and Y are highly out of synchronization.  A covariance near 0 means X and Y are uncorrelated.

To aid in the calculation of covariances, some simply algebra can produce an alternate equation:

This form is generally easier to calculate given sample data.

Correlation

As you could imagine, there’s an upper limit to how correlated 2 random variables can be.  It can be shown that the maximum value of a Cov(X,Y) is ± σxσy, the product of X and Y’s standard deviations.  Therefor, the covariance can be normalized to produce unitless values between 0 and ±1, given us a consistent measure of correlation among diverse sets of random variables.  This correlation coefficient is:

An Example

To illustrate the use of correlation in data analysis, let’s look at the plausible (but totally fictitious) results of a poll conducted on 1000 Americans, asking them two simple questions:

  • What was your gross income last year? (X)
  • Do you support federal statutes implementing ‘Cap and Trade’ standards? (Y)

Do only rich people reject emissions regulations?  Calculating the correlation coefficient can give us a quantitative answer to the question.  Here’s the raw data:

Income No (1) In Part (2)
Yes (3)
Total
< $20k (1)
12 10 114 136
$20k-50k (2)
61 23 95 179
$50k-$80k (3)
73 54 60 187
$80k-$120k (4)
82 54 12 148
$120k-$200k (5)
110 52 6 168
> $200k (6)
119 61 2 182
Total 457 254 289 1000

So, 12 respondents, or .012%,  reported an income of < $20,000 and ‘No’ to cap-and-trade standards.   457, or 45.7%, in total thought cap-and-trade is a bad idea.

To start, we need to define our random variables, X and Y, and the numeric mapping we’ll use for them.  Let X be the answer to the income question , and Y be the answer to the cap-and-trade question.  As the table shows, we’ve assigned arbitrary numbers to each possible response to each question.  We can now make some basic calculations on the data.

With these values, we can now calculate the covariance and correlation coefficient between X and Y.

While it may not be clear from the covariance alone, compared to the possible range 0 to ±1, the correlation coefficient shows us there’s a relatively strong negative correlation between these two poll results.  That is, the larger the answer given from the income question, the less likely the respondent would support cap-and-trade legislation.  From our fictional survey, we have confirmed (to a degree) that rich people are less likely to support cap-and-trade than poor ones.

Although we’ve taken a critical step in understanding our polling data with the use of correlation, there’s still much more to consider.  To complete our toolset, we must look at population sampling, hypothesis testing, and the difference between correlation and dependence.

Polling 101: Dependence

In our ongoing series on polling theory and statistical analysis, we pick up where we left off from our discussion of probability, and introduce 2 new concepts: conditional probability and independence.

Conditional Probability

We’ve already seen that probability is defined as a value between 0 and 1 that represents the relatively likelihood that a particular event will occur.  If we define a random variable X to denote the roll of a single fair die, and the event A to denote the outcome of an even number, then:

formulas courtesy of mathURL.com

since there are 6 possible outcomes, 3 of which yield an even number.

Conditional probabilities allow us to express the probabilities of one event, assuming another event has occurred.  For example, if we write Pr(B|A), we are describing the probability that event B occurred, given event A has occurred.  Using our original event A above, if we define event B as rolling a 2, then:

since there are 3 possible outcomes (given the roll was an even number), 1 of which is a 2.

Finally, we can calculate the probability both A and B will occur.  This is the intersection of the events A and B, and its probability is calculated as:

So the probability of A and B is the probability of A, times the probability of B given A.  In our single die example, we can calculate the probability that the die is 2 and that the die is even as:

This should make sense since there’s only 1 roll of the 6 possibilities that make both events true.

Dependence

Sometimes, the probably of an event given another event is the same as the probability of that event alone.  When:

we say that M and N are independent.  Returning to the probability of 2 intersecting events:

if and only if A and B are independent

This doesn’t hold for our original events A and B (Pr(B|A) = 1/3, Pr(B) = 1/6), but let’s define a second random variable Y which corresponds to the roll of a second fair die, with event M denoting that the second die is even, and event N denoting that the second die rolls a 2.  So the probability of rolling two 2’s is:

Here we used the reciprocity of the “if and only if” clause; in other words, the rule can be reversed if necessary.  Common sense tells us that the roll of the second die is in no way influenced by the roll of the first and therefor they are independent of each other.  As a result, we can simply multiply the individual probabilities.

Event Unions

Since we’re talking about the intersection of 2 events, we should probably cover the union of them as well.  Where the intersection denotes events A and B occurring, union denotes A or B occurring.  To find the probability of a union of 2 events, the equation is:

To illustrate this relationship, consider a Venn diagram of events A and B:

union

If we simply added Pr(A) and Pr(B), we’ll include the area Pr(AB) twice, so we we need to subtract one out.  If we define event C as the value of the first die greater than or equal to 3, we can calculate:

Of the 6 possible outcomes, only rolling a 1 fails to satisfy the event A+B.  Draw a Venn diagram to convince yourself.

Finally, if Pr(AB) = 0, then Pr(A+B) = Pr(A) + Pr(B).  In this case, we call events A and B mutually exclusive.  Consider Pr(B+C).

Class dismissed.

Polling 101: Probability

In our last post on Expected Values, we covered some of the basic tools a pollster might use to analyze a single random variable.  Before we move on to comparing 2 RV’s, we first need to cover one more preliminary concept.

Probability

Probability is a concept that’s easy to understand, but difficult to master.  It is a number between 0 and 1, assigned to an event, that indicates the likelihood of that event occurring.  An event is a collection of outcomes from a given RV.  The event of rolling an even number on one throw of a fair die includes the outcomes 2, 4, and 6.  An event with 0 probability means there’s no chance it will occur; an event with a probability of 1 means that it will occur with absolute certainty.

We calculate probabilities of events when we don’t have a procedure to predict the outcome with certainty (as with the trajectory of a rocket, for example).   Calculating them is a mix of art and science; a variety of mathematical rules are available to help you, but in the end, it’s a matter of analyzing all the available information.

One such  rule is, if all outcomes are equally likely (i.e. a uniform distribution), the probably of an event is the number of outcomes that satisfies the event, divided by the total number of outcomes.  The probability of rolling a 5 on a die is 1/6 (0.167) since only 1 outcome yields this result and there are 6 possible outcomes.  The probability of throwing an even number is 3/6 (1/2 or 0.5) since there are 3 outcomes that yield this result.

Probability Functions

We denote the probability of event A as Pr(A).  We can also define a probability density function (PDF) for continuous RV’s, and a probability mass function (PMF) for discrete RV’s.  PDF’s and PMF’s plot the probability of each outcome in a single function.  Many common functions occur frequently in nature.  One of the most common is the Gaussian or Normal function show below.

courtesy of wikipedia.org

courtesy of wikipedia.org

The Normal PDF forms the well known “bell-shaped curve”, with it’s mean value, µ, at the center.  We’ll return to the Normal function later, but for now will confine ourselves to discrete PMF’s.  For our die example, the PMF is:

coutesy wikipedia.org

coutesy wikipedia.org

Here we see that the probability of each outcome is the same, 1/6.