Declared data vs inferred data: Which is more reliable?

Source: CartoonStock.com. Used by permission.

Conventional thinking on how to discover what Internet audiences are really like, is that declared data is preferable over inferred data.  This means that registration pages or surveys that users fill in themselves should be more valuable than what is implied by their actual content engagement and online behavior.  However, there are several reasons to doubt this, and I’m going to outline here some of the biggest ones.

Suppose that your latest website visitor just filled out your online survey, in which she claims the following:

  • income over $250,000
  • volunteers more than 100 hours per year
  • watches less than 2 hours of TV or streaming weekly

Do you believe all of that?  What if her cookie trail shows that in the last month she has visited 73 pages across 12 fan sites for Dr. Who, Star Trek, X-Men, Agents of Shield and Supernatural, wherein she has made thirteen comments and up- or down-voted 43 other comments? Do you still believe the “less than 2 hours” of weekly media consumption?  And if that’s not trustworthy, what about everything else she indicated?

What you are looking at is a case of inferred data that seems to paint the real picture of a user, versus the false impression we would get by relying too heavily on that user’s declared data.  But how common is this situation?

Very common indeed.

The plague of pollsters

It is well known among professional pollsters  and survey researchers that people often withhold the truth on surveys, even when they know their answers will remain anonymous.  In fact, there is special terminology for different aspects of the phenomenon:

  • The “Bradley effect” refers to cases where voters do not reveal their intent to vote for or against a political candidate of a particular race.  It was so-named when African-American Tom Bradley lost the election for Governor in California despite polls giving him a big lead.
  • The “Shy Tory Factor” refers to conservative-leaning voters being unwilling to say that they will vote conservative. It was dubbed after John Major pulled off a surprise victory in the UK in 1992.  Despite pollsters changing methods to prevent the effect, it happened again in the 2015 election in the UK.
  • Self-verification” is the distortion in one’s memories and perceptions about oneself to selectively include only what agrees with one’s existing self-image.
  • “Farming” is the activity of repeatedly, purposely lying on surveys for some tactical advantage one hopes to gain, for example:
    • a consumer may believe (rightly so) that he or she will get more special offers, discounts, etc. but only if certain data is present in their user profile
    • a user may want to influence the decisions others will make based on the aggregate survey results, i.e., he or she sees the biasing of their profile as a way of “voting” for something

And here are some interesting statistics about lying in online profiles and surveys:

  • An ORB International survey asking people to declare how often they were truthful on surveys, showed that 20% did not intend to be “always truthful” in their responses. When asked in particular if they would share the truth about intimate matters, the percentage of those not intending to be truthful rose to 33%.
  • In a survey asking whether students had lied in their online profiles about their personal data (including gender, job, etc.), anywhere from 3% to 10% admitted posting false information, depending on which data was in question.  And that’s just the ones who will admit to directly falsifying this information.
  • Because churches tend to keep good attendance records, we know that consistent church attendance in the US is about 20%, but online and telephone surveys regularly clock in between 31% to 36% — meaning a lot of people are greatly exaggerating.

Examples can be multiplied.  My favorite is the survey that yielded a whopping 18% usage rate among teens for a fictitious drug that researchers added to the list just to see how many teens would claim using a non-existent substance.  If 18% marked “Yes” to a made-up drug, how many more might have lied on some other aspect of the questionnaire?

“Declared vs inferred” is a false dichotomy anyway

“But we only want to use declared data,” one might say.  No matter how hard we try, there are several ways that handling declared data can drag us into making plenty of inferences before we can actually use that data, and where we simply cannot take the declared data at face value.

  • Same data point, different declaration:  When the same user filled out two registrations or surveys, once checking “White” for ethnicity but the other time checking “Asian”, should we infer that the individual is both of those?  Or, that one or the other response is a mistake (or a lie)?  Both responses are declared data, but we still have to infer how to process the combination.
  • Declarations that undermine each other: Suppose a registration comes in saying a person’s job title is CTO for a company having 2000+ employees, but that the same user, for income bracket, marked “less than $60,000 annually”.  So did she accidentally mark the wrong income box?  Does she work at a startup for dirt and lie about the company size?  Again, this is declared data, but demands inference in order for us to properly use it.
  • Formal vs informal declarations: Let us not forget that there is a wealth of “indirectly declared” data, by which I mean declarations made by users about themselves outside of any survey or questionnaire of any kind.  For example, in social media updates or comments, suppose a person says, informally, “I bought one of these for my 9-year old last Christmas and he loves it.”  From this I can infer that the commenter is the parent of a grade-school boy, but I would say that she pretty much “declared” it, didn’t she?  And if in her official self-declared profile she marked “single, no kids”, should I continue to believe that over her actual comment in a discussion thread?

The more we dig into the combination and overlap of declared and inferred data, the more we see that managing inferred data is not just a way of extending the value of declared data;  it is the only way to properly handle the declared data.

To give one illustration of this: pollsters have made some progress combating the Shy Tory Factor described above, and you know how they are doing it?  Inferred data.  For example, they are checking how people voted in the last election and inferring that they will tend to vote the same, weighing that tendency against the declarations of voters about an upcoming vote.

This is likely to be the wave of the future: actual behavior and engagement data will be seen as a necessary correction to declared data. It’s as if the people en masse are effectively telling us, to borrow the immortal words of Attorney General John Mitchell, “Watch what we do, not what we say.”

-Tim Musgrove
@tmusgrove

Leave a Reply

Your email address will not be published. Required fields are marked *