This week Chase Bank sent an email to its customers saying that one of their vendor’s computer systems were hacked. The bank stated that they:
…are confident that the information that was retrieved [i.e., stolen] included some Chase customer e-mail addresses, but did not include any customer account or financial information. Based on everything we know, your accounts and financial information remain secure.
Confidence based on whatever they happen to know, eh? Because Chase could easily be mistaken, customers would be foolish to put their full trust in the bank’s assurances. I definitely plan to keep an eye on my Chase account for the next several months.
This same caution also applies to the most recent OCLC membership report, Perceptions of Libraries, 2010: Context and Community. The report’s energetic graphics and narrative make the information seem to be true. But, as my prior posts1 explain, surveys are always incomplete and imperfect. Findings from a single survey like OCLC’s are just not weighty enough to deserve our unconditional trust. At best, the findings are well-calculated estimates, at worst they can be really bad guesses. In a word (the U word, actually), some level of uncertainty is always embedded in the information.
As with the Chase Bank situation, we have to wonder what data the OCLC researchers don’t have. Specifically, what kinds of people didn’t respond to the survey? How likely were over-surveyed college students to respond? Which citizens relying solely on public library computers would spend their online time answering the survey? Which sorts of people won’t be found among the few millions of Harris Interactive survey panel members?2 And so on.
The OCLC study and the Chase data theft have something else in common. In both cases the institutions are selective about which information they pay attention to. In the jargon of scientific research this is called confirmation bias—intentionally or unintentionally considering only evidence that supports one’s preconceived notions and ignoring evidence to the contrary. Scientists and engineers strive to avoid this bias since it can be so dangerous.
Here’s a creative interpretation of data which otherwise don’t support the OCLC researchers’ viewpoint:
Millions of Americans, across all age groups, indicated that the value of the library has increased during the recent recession.3
Click for larger image.
Note in the top barchart that among the different respondent age groups (labeled on the horizontal axis) from 16% to 36% perceived the library as more valuable. The researchers call these “double-digit percentages” to suggest that they are large. But their relative smallness becomes obvious when we look at the rest of the data. The lower barchart (with brown-beige bars) shows that from 64% to 84% did not perceive libraries as more valuable. Each age group has a 2 to 1 or higher majority reporting no increased library value. However, in the OCLC study the voices of these majorities fell upon deaf ears.
The minority also won in the chart shown here:
Source: Perceptions of Libraries, 2010, OCLC, Inc., p.27. Click for larger image.
The chart’s caption is untrue. Only 37% of survey respondents reported increased library use, and the OCLC study doesn’t say how much this use increased.
Eagerness to prove a point occasionally leads the OCLC researchers to misinterpret the data they do pay attention to. The study includes a wheel-shaped chart, shown here, that I also discussed in my prior post:
OCLC Circular Chart Comparing Respondent Group Library Use
Source: Perceptions of Libraries, 2010, p. 24 Click for larger image.
The chart gives percentages of economically impacted respondents who use one or another library service monthly, compared to non-impacted respondents.4 Although the caption reads, “Economically impacted Americans use library services more frequently,” the data in the chart aren’t measures of more versus less frequent use. Instead, they indicate what the groups do with equal frequency, that is, monthly.
Based on data of the sort that’s in the chart, we might observe that one group has a higher proportion of its members represented in the entire group of monthly users than another group. But this would not mean that members of that group use library services more frequently than the other group does, nor that one group is more prominent among all monthly users than the other group.5 Neither would this sort of data indicate whether one group uses library services more frequently than they did in the past.
Most readers of the study won’t recognize the one giant leap of faith the OCLC researchers make in order to confirm their preconceptions. They use the word Americans repeatedly, implying that they are reporting definitive information about the U.S. population at large. However, the study’s portrayal of Americans isn’t definitive and the researchers know this. The Americans portrayed in the survey are merely individuals in the U.S. who participate in Harris Interactive’s survey panels. Statistically, findings from the sample of 1,334 Americans who were polled6 can describe only (statisticians say can be generalized to) this larger group of panel participants. Beyond this, this larger group is believed to be reasonably representative of all online users in the U.S.
But, here’s the glitch. In the last few pages of the report the researchers make this remarkable disclosure:
The online [American] population may or may not represent the general [American] population.7
Well, that pretty much covers the options, but what does it mean? It means the researchers have no idea how well the study figures resemble figures that are true for the entire U.S. population. Although statements like the following pervade the OCLC report…
A third of American families had at least one family member who experienced a negative job impact during the recession.8
…in the fine print we learn that the researchers aren’t very confident about these. (We would say they are 50% confident about the statements if we assume may or may not indicates two equal probabilities.) Thus, for the statement above the true amount could just as easily be a half of American families, three-eighths, a fourth, a fifth, one tenth, or something else. This uncertainty pertains also to extrapolations the researchers make from survey percentages to U.S. population counts, like:
13 million economically impacted Americans—that is more than the populations of New York, Chicago and Houston combined—are using the library more during the challenging economic time.9
It’s hard to know in which direction and how much this estimate might need adjusted to accurately reflect the U.S. population. Maybe we need to remove Houston from the count, or maybe we should add Philadelphia.
Does the sketchiness (uncertainty) in the OCLC findings mean the study itself is somehow defective? Not at all. All surveys, and all measurements for that matter, are sketchy to one degree or another. We just need to be intelligent about this sketchiness.
If the data are too inexact for our purposes, we need to improve our data collection methods and then re-measure. If the data are good enough, we are obliged to use them conscientiously, explaining their limitations (uncertainty) loudly and clearly so we don’t lead our audience astray.10 If people become overconfident about our study conclusions, they could end up making the wrong decisions, like choosing to blissfully ignore their bank statements.
1 See Discussing Accuracy, Checking It Twice, Stranger Than Fiction, and Objects In Mirror Are Closer Than They Appear.
2 Statisticians at Harris Interactive understand questions like these and, in some cases, apply statistical adjustments meant to address them. But these adjustments are not guaranteed to work (see Discussing Accuracy).
3 Online Computer Library Center, (2010). Perceptions of libraries, , Dublin, OH: Online Computer Library Center, p. 44.
4 The report defines economically impacted respondents on p. 20. Non-economically impacted respondents are those not meeting the criteria outlined in that definition.
5 Replacing the words “more frequently” with “at a higher rate” makes this easier to comprehend. Also, don’t confuse the idea of which group uses a service more (group A uses more services than group B) with which group uses a service more frequently (most group A members use the service twice a month while most group B members use it monthly). The former has to do with total number of services used, the latter with the rate of use.
6 Online Computer Library Center, (2010). p. 102. To arrive at this figure you have to total the left column in the table entitled Total U.S. Respondents.
7 Online Computer Library Center, (2010). p. 102. Red emphasis added. In case you didn’t figure it out, may or may not is the epitome of uncertainty!
8 Online Computer Library Center, (2010). p. 6.
9 Online Computer Library Center, (2010). p. 26.
10 The OCLC researchers do provide some information about the limitations of their data in the form of a margin of error estimate of +/- 2.7%. Add or subtract the margin to/from the survey figures and you get a good (statisticians say plausible) guess about where the true values from the population probably lie. By reporting this margin, the researchers are announcing they are very (95%) confident that the true figures from the larger population fall within this range. However, since the sample was drawn from Harris Interactive’s participant population, the +/- 2.7% range only applies to that population. Unfortunately, this margin of error doesn’t provide us with a plausible range of values for the U.S. population at large. For an entertaining primer on margin of error see Stranger Than Fiction.