Welcome!

Welcome

Welcome to Prolog, proloquor.net’s web log.  Here you can learn more about proloquor.net, the art and science of opinion polling, and anything else that comes to our minds.

The Hope of Deliberative Polling

When only two thirds of Americans can name the Vice President of the United States or their own state’s Governor, it seems little use to do any polling of public opinion at all.  With public opinion formed largely by 10 second sound bites and Daily Show skits, it seems opinions can be molded to reach any conclusion a media outlet wants to support.

To combat the public’s ignorance of today’s complex issues, and the faulty polling data it creates, Stanford’s James Fishkin developed a polling method he coined Deliberative Polling.  In Deliberative Polling, respondents are first polled on a series of questions about the issues of the day, but then are asked to participate in a series of moderated discussions that explore the subject areas in a thorough but balanced fashion. The respondents are polled again and their answers are compared before and after the discussion.

Proponents of Deliberative Polling such as psychologist and noted eldercare authority Ira Rosofsky point out that a well-informed sample population can return drastically different responses than an uninformed one, particularly on such current topics as healthcare.  In fact the effect of the deliberative nature of this method can be measured by comparing the difference in the before and after polls.

There is of course, a catch.  While it may be possible to conduct a thorough investigation of a subject before re-polling the respondents, it is quite difficult to be simultaneously thorough and balanced.   In fact, a good rule of thumb in the polling world is that the less you discuss the topic with the respondent, the more balanced the questions are likely to be.  Every word you add to the polling question (or setup) increases the chance that you’ll introduce a bias into the poll.  Furthermore, there is the difficulty of unringing the bell.  Once the question has been asked, you have planted a seed in the respondent’s mind that is difficult to remove, regardless of how much discussion you conduct afterwards.

So while Deliberative Polling shows promise, we won’t be doing any of it at Proloquor.net.  Instead, we plan to let our polls take their natural course, and will resist the siren’s song of Informing the Public.  Let’s hope the public is up to the task.

Rasmussen Takes Polling to the People

David Frum, Bush(43) speech writer turned conservative blogger at frumforum.com (formerly NewMajority.com), posted an interesting interview with pollster Scott Rasmussen about his latest venture in automated polling.    Apparently bowing to the pressure to introduce polls commissioned by a third parties into Rasmussen Reports, Scott recently spun off Pulse Opinion Research, specializing in client-centric polling.

In the interview, Rasmussen announced that Pulse will introduce a service next month that allows anyone on the Net to commission a poll for as little as $600.  The polling mechanism will use Rasmussen’s standard automated telephone inquiry system.

While the system comes under routine fire by critics who question the fairness of automated queries, unedited questions from otherwise anonymous clients will likely only strengthen their arguments.  Still, its a bold step by one of the innovators of the polling industry.  If Pulse can provide more than the autodialer with consultation services on questioning and statistical analysis of the results, it would be the first step into true open source polling.

Unringing The Bell

I ran across an interesting observation this week about how the UK online polling service yougov.com conducts ‘back-to-back’ polls, or successive polls asking basically the same question only a few days apart.  The danger of this is that the experience of answering the first question will influence the answer on the second.  There’s a certain Heisenberg Principle at work here; once you’ve observed a phenomenon, you’ve changed its behavior in some way.

This is an unfortunately side effect of panel-based polling.  The most likely scenario is that the poller has crafted a question, only to realize it’s biased or otherwise flawed after the poll was conducted.  If he wishes to correct his mistake and repeat the poll, he’ll likely produce similarly skewed results since the respondents will simply repeat their previous response, without appreciating the subtle difference in the new question.

Of course, back-to-back polls are only a problem if you asked the same group of respondents both questions, as is the case with an opt-in, panel-based polling system like yougov.com.  In a traditional ‘random’ sampling where you can be relatively certain of different sample sets, you’re OK.  The downside of traditional systems however is that you loose the ability to correlate responses among your sample set.

A Fresh Approach: Minekey.com

To say there are many online forums is a typical understatement of the Internet age.  By Google’s last  count, there are over 350,000,000.  In this Web 2.0 world, the mantra remains “power to the people” or, more realistically, “let your visitors provide your content”.  That’s why I’m always encouraged when I see people trying to do more with visitors’ content besides  simply posting it.

minekey.com is one example which attempts to provide some structure to the din of opinions expressed by its purported 3 million posts each day.  Integrated into its bulletin board however, is a simple polling mechanism that allows subsequent visitors to register a simple agree/disagree vote on each opinion.  The polling results are tabulated and presented  in real-time.  Results are available in whole, or categorized by gender and age; two factors you are asked to identify about yourself when you register.

To be sure, there’s a lot more they could do to understand the data they collect.  Since the polls are created entirely by their members, the questions are awkwardly worded, and often blatently biased.  I’d also like to see results cross-referenced or correlated between different, but related questions. It’s also apparent that Minekey needs to further promote the polling features of their service, since many more members leave comments on each post than votes.

Never the less, Minekey is a fresh look at one of the Internet’s basic services.  Check it out.

Polling 101: Visualizing Correlation

We’ve seen that we can quantify how two random processes, or random variables,  synchronize with each other by calculating the correlation between them.  A correlation coefficient near +1 means the two random variables are positively synchronized or correlated.  -1 means they are negatively correlated.  Coefficients near 0 means the two random variables are uncorrelated, and don’t track each other at all.  In our hypothetical poll, we calculated the coefficient between two polling questions as -0.55, a relatively strong negative correlation.

Plotting Correlated Random Variables

While calculating a single number to assess correlation is a handy method for quickly comparing two random variables, it’s also helpful to plot the data on a graph, and inspect the relationship between the two variables.  While a standard ’scatter plot’ works fine for most statistical analyses,  we find that more sophisticated methods are required to visualize polling data.

Consider an investigation that studies the link between stress and high blood pressure.  If you were to interview 20 volunteers and measure the level of stress in their lives along with their average blood pressure, you could use the data to create the following plot.

correlation

Here, each point represents a single test subject, with its x value representing their level of stress and y value representing their blood pressure.  Even without calculating the correlation coefficient, we can easily see that there’s a relatively strong correlation between the two.  The plot also shows the line of best fit, calculated from the data points (we’ll show how to determine that line in a later post).

But what would we see if we form a similar scatter plot on our fictitious polling data, measuring respondents  support of Cap-and-Trade legislation and income bracket.

corr1

Plots created with gnuplot.

3D Surface Plots

The simple scatter plot above is not very informative.  Because we have only 6×3=18 possible outcomes, many individual outcomes were simply stacked up on one of these 18 possibilities.  If we introduced a 3rd axis however, we could evaluate the relative strength of each possible value.

corr2

That’s a bit better.  Here we see the 18 possible results as defined by their x and y coordinates, and the frequency of respondents who selected them by the z axis.  If we look closely enough, we might see a ridge running from (Income=1, Cap-and-Trade=3) to (Income=6, Cap-and-Trade=1).  This is the 3-dimensional equivalent of that cluster of points around a straight line and indicates correlation.

To further aid visual analysis, we must fill in the gaps, or interpolate, between the given data points with approximate values.  We can then color-code the surface to further illustrate the relationship between the data points.

corr3

corr4

The saddle shape of these plots shows the ridge that indicates a fairly strong correlation between these 2 results.  Ultimately, we can turn the right-hand surface plot into a 2-dimensional map and simply follow the colors.

corr5

So, the rule for interpreting map plots that compare 2 result sets is simple.  If you can draw a straight line across the plot by touching only hot colors (i.e. those that represent high-frequency result pairs), you’ve identified correlated events.  This technique, combined with the calculation of the correlation coefficient, will give you a good clue that the 2 questions are correlated.

Polling 101: Correlation

Correlation and dependence are two key statistical concept to polling.  Understanding the differences between these two concepts are critical if you are to understand the results of your labor.

Covariance

We’ve already defined the variance of a random variable as E[(X-µx)2].  This value measures the spread of values it takes on about the mean.  Now let’s define the Covariance between two random variables, X and Y, as:

Equations by mathURL.com.

Before we begin using this equation on actual data, it requires closer inspection.  X and Y represent the result of an endless series of experiments, each resulting in a number.  These numbers, on average, are centered around their respective means.  Therefor, (X – µx) and (Y – µy) vary similarly around 0.

Consider an X and Y that track each other closely.  That is, when X takes on its largest value,  Y typically takes on its largest value too. When X is relatively small, Y is too, and so on.  Similarly,  (X – µx) and (Y – µy) track each other but, irregardless of the sign of X and Y, they range between positive and negative values about 0.  Consequently, (X – µx) (Y – µy) would generate  large positive values, as the product of either 2 large positive numbers or 2 large negative numbers.  Now consider an X and Y that are completely “out of sync”.  That is, when X is at its highest value, Y is typically at its lowest, and so on.  Now, (X – µx) (Y – µy) would be the product of a large positive value and a large negative value (or vice versa), resulting in a large negative value.  Finally,  consider an X and Y whose relationship continuously shifts from perfectly in sync, to perfectly out of sync.  (X – µx) (Y – µy) would some times be positive, sometimes be negative, so the expected value of (X – µx) (Y – µy) would tend to 0.

The covariance between two random variables then is a measure of how synchronized, or correlated, they are from sample to sample.  Large positive covariances mean X and Y are highly synchronized.  Large negative values mean X and Y are highly out of synchronization.  A covariance near 0 means X and Y are uncorrelated.

To aid in the calculation of covariances, some simply algebra can produce an alternate equation:

This form is generally easier to calculate given sample data.

Correlation

As you could imagine, there’s an upper limit to how correlated 2 random variables can be.  It can be shown that the maximum value of a Cov(X,Y) is ± σxσy, the product of X and Y’s standard deviations.  Therefor, the covariance can be normalized to produce unitless values between 0 and ±1, given us a consistent measure of correlation among diverse sets of random variables.  This correlation coefficient is:

An Example

To illustrate the use of correlation in data analysis, let’s look at the plausible (but totally fictitious) results of a poll conducted on 1000 Americans, asking them two simple questions:

  • What was your gross income last year? (X)
  • Do you support federal statutes implementing ‘Cap and Trade’ standards? (Y)

Do only rich people reject emissions regulations?  Calculating the correlation coefficient can give us a quantitative answer to the question.  Here’s the raw data:

Income No (1) In Part (2)
Yes (3)
Total
< $20k (1)
12 10 114 136
$20k-50k (2)
61 23 95 179
$50k-$80k (3)
73 54 60 187
$80k-$120k (4)
82 54 12 148
$120k-$200k (5)
110 52 6 168
> $200k (6)
119 61 2 182
Total 457 254 289 1000

So, 12 respondents, or .012%,  reported an income of < $20,000 and ‘No’ to cap-and-trade standards.   457, or 45.7%, in total thought cap-and-trade is a bad idea.

To start, we need to define our random variables, X and Y, and the numeric mapping we’ll use for them.  Let X be the answer to the income question , and Y be the answer to the cap-and-trade question.  As the table shows, we’ve assigned arbitrary numbers to each possible response to each question.  We can now make some basic calculations on the data.

With these values, we can now calculate the covariance and correlation coefficient between X and Y.

While it may not be clear from the covariance alone, compared to the possible range 0 to ±1, the correlation coefficient shows us there’s a relatively strong negative correlation between these two poll results.  That is, the larger the answer given from the income question, the less likely the respondent would support cap-and-trade legislation.  From our fictional survey, we have confirmed (to a degree) that rich people are less likely to support cap-and-trade than poor ones.

Although we’ve taken a critical step in understanding our polling data with the use of correlation, there’s still much more to consider.  To complete our toolset, we must look at population sampling, hypothesis testing, and the difference between correlation and dependence.

Polling 101: Dependence

In our ongoing series on polling theory and statistical analysis, we pick up where we left off from our discussion of probability, and introduce 2 new concepts: conditional probability and independence.

Conditional Probability

We’ve already seen that probability is defined as a value between 0 and 1 that represents the relatively likelihood that a particular event will occur.  If we define a random variable X to denote the roll of a single fair die, and the event A to denote the outcome of an even number, then:

formulas courtesy of mathURL.com

since there are 6 possible outcomes, 3 of which yield an even number.

Conditional probabilities allow us to express the probabilities of one event, assuming another event has occurred.  For example, if we write Pr(B|A), we are describing the probability that event B occurred, given event A has occurred.  Using our original event A above, if we define event B as rolling a 2, then:

since there are 3 possible outcomes (given the roll was an even number), 1 of which is a 2.

Finally, we can calculate the probability both A and B will occur.  This is the intersection of the events A and B, and its probability is calculated as:

So the probability of A and B is the probability of A, times the probability of B given A.  In our single die example, we can calculate the probability that the die is 2 and that the die is even as:

This should make sense since there’s only 1 roll of the 6 possibilities that make both events true.

Dependence

Sometimes, the probably of an event given another event is the same as the probability of that event alone.  When:

we say that M and N are independent.  Returning to the probability of 2 intersecting events:

if and only if A and B are independent

This doesn’t hold for our original events A and B (Pr(B|A) = 1/3, Pr(B) = 1/6), but let’s define a second random variable Y which corresponds to the roll of a second fair die, with event M denoting that the second die is even, and event N denoting that the second die rolls a 2.  So the probability of rolling two 2’s is:

Here we used the reciprocity of the “if and only if” clause; in other words, the rule can be reversed if necessary.  Common sense tells us that the roll of the second die is in no way influenced by the roll of the first and therefor they are independent of each other.  As a result, we can simply multiply the individual probabilities.

Event Unions

Since we’re talking about the intersection of 2 events, we should probably cover the union of them as well.  Where the intersection denotes events A and B occurring, union denotes A or B occurring.  To find the probability of a union of 2 events, the equation is:

To illustrate this relationship, consider a Venn diagram of events A and B:

union

If we simply added Pr(A) and Pr(B), we’ll include the area Pr(AB) twice, so we we need to subtract one out.  If we define event C as the value of the first die greater than or equal to 3, we can calculate:

Of the 6 possible outcomes, only rolling a 1 fails to satisfy the event A+B.  Draw a Venn diagram to convince yourself.

Finally, if Pr(AB) = 0, then Pr(A+B) = Pr(A) + Pr(B).  In this case, we call events A and B mutually exclusive.  Consider Pr(B+C).

Class dismissed.

Wordsmithing Polling Data

Frank Luntz, has been getting a lot of press lately, particularly on Fox, over his recently published book “What American’s Really Want…Really” (amazon.com).  Luntz is a conservative political pollster with a talent for identifying words and phrases that steer public opinion, usually over actual facts.  In the book, he conducts a fairly in-depth series of surveys, asking Americans how the relate to the following statements:

  • “I’m mad as hell, and I’m not going to take it anymore.”: 72%
  • “My kids will have a worse quality of life than I have.”: 57%
  • “Live free or die.”: 88%
  • “The 10 commandments are a good guide to live by.”: 89%
  • “I want it all, and I want it now.”: 35%

If you’re one of the few that look at polling data as a tool (really), you must be scratching your head at these numbers.  While conservatives have taking these data to underscore the momentum of townhall gatherings, liberals could make exactly the same claims to bolster the current administrations plans.  If polling data is to be useful, it must look past rhetoric and phraseology and into the public’s will.

Understanding Trends

Here’s a report from Gallup that shows people’s perception of healthcare quality in the U.S. actually rose this year from last.  They admit, however, they’re stumped when explaining the shift.  What follow-on questions could they ask to shed some light on the cause of this trend?

More Confusion on Healthcare

Chicago Tribune Logo

Another story, this time by the Chicago Tribune, musing about contradicting polling data over healthcare reform.  Some polls suggest support for the public option, others show dissatisfaction with the bills that provide it.  Either the data is wrong, the questions are inconclusive, or the respondents are idiots.  The author, Eric Zorn, and Mark Blumenthal believe it’s the latter.  It should be no surprise to these veterans that asking simple questions to people (of any IQ) about incredibly complex topics should lead to contradictory results.  It doesn’t make for the best media soundbites, but decomposing the problem into very small issues and building up a picture based on those ‘mini-responses’ may paint a clearer pictures.

But who would want to be on the phone that long with a pollster?