Posts tagged ‘visualization’

Polling 101: Visualizing Correlation

We’ve seen that we can quantify how two random processes, or random variables,  synchronize with each other by calculating the correlation between them.  A correlation coefficient near +1 means the two random variables are positively synchronized or correlated.  -1 means they are negatively correlated.  Coefficients near 0 means the two random variables are uncorrelated, and don’t track each other at all.  In our hypothetical poll, we calculated the coefficient between two polling questions as -0.55, a relatively strong negative correlation.

Plotting Correlated Random Variables

While calculating a single number to assess correlation is a handy method for quickly comparing two random variables, it’s also helpful to plot the data on a graph, and inspect the relationship between the two variables.  While a standard ’scatter plot’ works fine for most statistical analyses,  we find that more sophisticated methods are required to visualize polling data.

Consider an investigation that studies the link between stress and high blood pressure.  If you were to interview 20 volunteers and measure the level of stress in their lives along with their average blood pressure, you could use the data to create the following plot.

correlation

Here, each point represents a single test subject, with its x value representing their level of stress and y value representing their blood pressure.  Even without calculating the correlation coefficient, we can easily see that there’s a relatively strong correlation between the two.  The plot also shows the line of best fit, calculated from the data points (we’ll show how to determine that line in a later post).

But what would we see if we form a similar scatter plot on our fictitious polling data, measuring respondents  support of Cap-and-Trade legislation and income bracket.

corr1

Plots created with gnuplot.

3D Surface Plots

The simple scatter plot above is not very informative.  Because we have only 6×3=18 possible outcomes, many individual outcomes were simply stacked up on one of these 18 possibilities.  If we introduced a 3rd axis however, we could evaluate the relative strength of each possible value.

corr2

That’s a bit better.  Here we see the 18 possible results as defined by their x and y coordinates, and the frequency of respondents who selected them by the z axis.  If we look closely enough, we might see a ridge running from (Income=1, Cap-and-Trade=3) to (Income=6, Cap-and-Trade=1).  This is the 3-dimensional equivalent of that cluster of points around a straight line and indicates correlation.

To further aid visual analysis, we must fill in the gaps, or interpolate, between the given data points with approximate values.  We can then color-code the surface to further illustrate the relationship between the data points.

corr3

corr4

The saddle shape of these plots shows the ridge that indicates a fairly strong correlation between these 2 results.  Ultimately, we can turn the right-hand surface plot into a 2-dimensional map and simply follow the colors.

corr5

So, the rule for interpreting map plots that compare 2 result sets is simple.  If you can draw a straight line across the plot by touching only hot colors (i.e. those that represent high-frequency result pairs), you’ve identified correlated events.  This technique, combined with the calculation of the correlation coefficient, will give you a good clue that the 2 questions are correlated.