Parachuting Cats

I apologize for such a delay since my last post. Other things have been occupying my time, obviously. With the Public Library Association’s (PLA) recent launch of its national survey effort, Project Outcome, I thought I ought to take the time to revisit a topic from an earlier post. To reintroduce this topic I begin with an excerpt from Gary Smith’s 2014 book, Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics. (Can you guess where this is headed?) Smith writes:

A study of 115 cats that had been brought to a New York City veterinary hospital after falling from high-rise apartment buildings found that 5 percent of the cats that fell 9 or more stories died, while 10 percent of the cats that fell from lower heights died. The doctors speculated that this was because cats falling from higher heights are able to spread out their bodies and create a parachute effect. What’s an alternative explanation?1

Smith offers this explanation: The study reached a false conclusion because the sample excluded cats that died immediately after their falls. And injured cats for whom caretakers opted not to seek veterinary care for whatever reason. Without data describing all cats falling from high-rise apartments we can’t really know how higher falls versus lower falls really work. No matter how intriguing patterns in the data seem—including the possibility of parachuting cats—the patterns can’t be taken seriously because the data are insufficient for understanding the situation completely.

Libraries using the Project Outcome survey are in the same boat as the veterinary hospital researchers. Patterns seen in their survey data are (with apologies to the cats) fishy because the data are incomplete and biased. Clearly, the project team recognized this shortcoming and announced that the survey findings are “community snapshots” and not “research-focused” as indicated in this slide from the Project Outcome webinar held last month:

PLAOutcome1_380

Slide from the September 10, 2015 Project Outcome webinar.

The webinar presenters didn’t explain the difference between these two data collection approaches. The difference is this: Community snapshots are anecdotes, in this case aggregated from a short 6-item questionnaires. Research, on the other hand, involves assuring that the data fairly and completely reflect the situation under study.

In case you think that evidence from systematically conducted research would easily trump anecdotal information, think again. To illustrate the appeal that anecdotes have I offer this account from Howard Wainer’s 2011 book, Uneducated Guesses: Using Evidence to Uncover Misguided Educational Policies:

I recognize that the decision to base action on evidence is a tough one. In the last year we have seen that people armed with anecdotes, but not data, attacked evidence-based advice about who should have mammograms. Indeed many of the anecdotes relied on emotional counterfactual statements for their validity: “If my sister had not had a mammogram, she would not be alive today” or “If she had a mammogram she would be alive today.” The fact that such logically weak argumentation yielded almost immediate equivocation from government officials attests to how difficult it is to use the logical weight of evidence over the emotional power of anecdotes. We must always remember that the plural of anecdote is not data.2

Our own profession’s principles of information literacy favor Wainer’s position: The more our information search covers all relevant aspects of a topic, the higher quality the information is. The goal is information accuracy, thoroughness, and trustworthiness.

As information, Project Outcome community snapshots have quality problems, so much so that these problems were the focus of the webinar, where they were labeled as “challenges.” I have to say, though, how gratifying it is to see the presenters discussing the limitations of data, a topic traditionally glossed over in national library studies. Usually these studies promote their findings with barely a mention of how approximate the data are or the chances, large or small, that the data could be wrong. Project Outcome is definitely on the right track on this. This much attention given to honestly examining possible weaknesses in data is a sign that the public library community has turned a methodological corner. However, there’s still a ways to go because the project veered off track with some of its advice for dealing with these weaknesses.

Let’s look first at the weaknesses they identified. In the webinar the project team’s main concern was survey bias, that is, slanted findings. In survey research whether and how much bias affects survey findings depend on:

1. How complete a range of subjects are polled (selection bias)
2. The extent to which subjects who are polled cooperate by completing surveys (non-response bias)
3. Whether the measurement instrument affects responses, for example, leading questions (question bias)
4. Whether the survey administration affects responses, for example, hints encouraging desirable responses (administration bias)
5. Whether subject responses are accurate and truthful (response bias)
6. Whether the construction of the questionnaire as a whole yields generally biased results, e.g., cultural-bias, gender-bias, age-bias, and so on (item bias)

The Project Outcome webinar classified challenges due to biased survey data into 3 categories:

1. Not enough responses
2. Responses too positive
3. Responses too negative

The not-enough-responses problem could be due to the first two of the 6 causes of bias listed above. The other two problems, too positive and too negative responses, could be due to a combination of any of the 6 causes. The project team didn’t say exactly why a small number of responses concerned them. I presume the worry was that this lessens the credibility of the findings. And they seemed to imply that selection and non-response bias were to blame.

The webinar presenter described a pilot test library receiving only 86 completed questionnaires out of 3,226 attendees to the library’s story time programs during the survey period and 137,379 attendees annually. The relevant population is the latter figure, the annual count. As the webinar participants noted, this figure is a duplicated count. So let’s just estimate an unduplicated count of 10,000 attendees annually. With this estimate 86 is less than 1% of 10,000, too small to conclude that the survey findings accurately reflect the larger population. If we choose a lower estimate, say 5,000 then the 86 responses constitute 1.8%. So then we have to determine what percentage is sufficient.

However, this is a moot point for the Project Outcome team, which maintains that there is no such thing as an insufficient sample size:

We don’t recommend that you have a minimum sample size for Project Outcome. But in this case it may be desirable to try to increase the survey response rate next time around. But it doesn’t mean you can’t use the survey results that you’ve received. These 86 responses still have meaning…While surveys from a small number of respondents may not fully represent the truth about the larger population in your community, there’s no minimum number of patrons that you’re required to survey.3

The problem is that the meaning attributed to the 86 responses can easily be wrong, like cats parachuting through the air. The sample data are just too incomplete to draw meaningful conclusions from. Incidentally, a small sample size is not necessarily a handicap if a survey uses representative sampling. It is representativeness, not size, which gives samples their informational strength.

The Project Outcome team is, I guess you could say, agnostic about conducting surveys using representative sampling methods. They don’t seem to talk about this option. The fact that they consider the tiniest sample, collected in whatever manner, to be acceptable is consistent with the idea that the project’s survey findings are essentially anecdotal. Scope, accuracy, and balance are not important with anecdotal information. The important thing is how well the information supports the intended message, as the webinar presenter attested:

The results still tell a powerful story. The survey results give quantitative data to support what most children’s librarians hear every day. Librarians will testify that the basic message of these numbers is not exaggerated. The vast majority of parents who attend story time know that they are immensely beneficial…Many of the library’s stakeholders—mayors, county commissioners, and city council members—are more likely to be moved by data than by the testimony of a librarian. And that’s the goal. To help funders and other stakeholders to see clearly the differences that our programs and services are making in people’s lives. That’s what Project Outcome is about.4

As you might expect, the differences are portrayed as undoubtedly positive and substantial. Except the survey doesn’t confirm that the positive and substantial differences really exist. To resolve this deficit the project team simply labels the findings community snapshots. Thus, they are counting on biased quantitative data carrying more weight than biased librarians! And also on survey report audiences not being familiar with studies like the veterinary hospital study with its parachuting cats.

While library stakeholders are apt to be under-informed about selection and non-response bias, they may not be so when it comes to the second methodological challenge the Project Outcome team identified, too-positive responses. Interestingly, the team takes these findings at face value, believing that ultra-successful results mean public libraries are highly effective and beneficial. The problem for libraries then becomes managing expectations. “When the results are that high, there is little room for improvement,” in the words of the webinar presenter.5

On the other hand, astute stakeholders may interpret the results differently, suspecting that the survey sets too low a bar for library performance. They might find the questionnaire’s agree-disagree items to be too easy, items like “You are more aware of issues in your community,” “You learned something new that is helpful,” and “You intend to apply what you learned.” Such general questions would surely elicit high levels of agreement, along the same lines as this item I made up: “The presenter said something interesting.”

Or stakeholders might see the survey as a too-lenient assessment because it doesn’t directly measure patron improvements in reading proficiency, digital resource use, civic engagement, job search skills, and so on. Or the fact that responses to the agree-disagree items range from no change (neutral) to positive change might also bother stakeholders. Outcomes for patrons who finished library programs more confused, less motivated, less confident, or in any way disadvantaged would end up in some neutral classification like no reported changes/benefits. So, libraries won’t have to worry about scores being pulled down by negative outcomes. Nor will they need to concern themselves with scores shrinking due to disappointed attendees dropping out of programs early (see attrition). To prevent duplicate surveys the survey team recommends that libraries poll participants only at the end of programs. Except this is when drop-outs will already be gone. The same way that some falling cats were absent from the New York City veterinary hospital study.

Library survey subjects differ from falling cats in one important respect. As far as I can tell, falling cats don’t realize they are being studied whereas patrons do (see reactivity). Which leads to additional biases due to socially desirable answers and the halo effect, positive associations with libraries, say from childhood, affecting responses. Not that libraries should not take advantage of patron good will. But measures of the effectiveness of a training class shouldn’t get mixed up with measures of patrons’ fondness for the library.

Considering all of these sources of bias, it’s difficult to envision the Project Outcome results being anything but positively slanted, an outcome the project team must have anticipated.

I realize that the PLA project has limited interest in formal survey research and questionnaire construction methods. But the highly positive response patterns they’ve observed suggest another dynamic at work here. When developing a measurement instrument, say a questionnaire to measure community awareness of library services, items in the questionnaire need to pass certain statistical tests up front, prior to full implementation in the field. One requirement is that the items show some minimal level of variability. If an item elicits uniform responses across a range of otherwise diverse subjects, that item is suspect. Chances are the item doesn’t measure anything real, or at least useful. This is because a primary function of survey instruments is distinguishing between respondents on dimensions relevant to the study at hand.

Consider psychometric and educational achievement testing as examples. If extroverted and introverted respondents to a psychometric test all answer a given item the same way, with no deviation (variation), then that item is probably irrelevant to the test and should be removed. If every educational test-taker answers a given test item correctly, then that item isn’t measuring the right thing. This applies to any questionnaires intended for gathering reasonably valid measures about something in the real world. Maybe the project team should consider looking closer at the variability of their survey items.

Then again, uniform responses go perfectly with the project team’s goal of proclaiming the immense benefits that library patrons receive. As I’ve written elsewhere, it takes a lot more than 6 survey questions casually administered to determine whether programs actually produce benefits. Effective outcome evaluation measures how much betterment has occurred in program recipients, what kinds of betterment were observed, how many participants these occurred for, specific characteristics of recipients for whom various levels of betterment did or did not occur, whether outcomes can be causally linked to program services delivered, and according to which stakeholder perspectives the observed outcomes were seen as valuable. I’m not saying Project Outcome’s mission should be to tackle this difficult form of evaluation. I am saying that the project team should not be making claims of assured impact and immense benefits based on such sketchy outcome data.

Then there’s Project Outcome’s aggregation of survey data as published national averages for each core service area. The averages are mis-labeled because they don’t represent U.S. public libraries as a whole. (Maybe the label Project Outcome averages would work instead.) The averages summarize only 4 of the 6 survey items, the agree-disagree ones. Also, participating libraries don’t necessarily submit data for all 7 service areas. (At this stage I’d expect most libraries would submit data for 1 or 2 areas.) Thus, the scope of the collective data is narrow right now. I expect that the PLA hopes the eventual pool of participating libraries will surpass subscriptions to its Public Library Data Service.

However large the project’s data pool becomes, comparisons with its average scores aren’t that informative. Due to the positive slant of the data the averages will tend to be clustered together and differences between most libraries will be quite small. Plus, libraries making comparisons need to be mindful of characteristics of the other libraries in the pool. And also of differences in local programming. Other than perhaps story time and summer reading, it’s difficult to know what mix of programs a library is being compared to in a given core service area. Therefore, many of the comparisons will be between apples and oranges. The most useful comparisons are those made among similar libraries and programs or what David Ammons calls respected counterparts (see his book, Municipal Benchmarks). These provide a more meaningful context for a library’s scores than the project’s national averages will.

Regardless of benchmarking best practices, there’s a drawback to Project Outcome scoring which causes the data to not be what they appear to be. Meaning they fall squarely within the realm of parachuting cats! Because the scores are ordinal data (as all Likert scales are), dividing them by respondent and library counts (which averaging does) produces quantitatively ambiguous results. A 0.5 difference between a 3.5 score and a 3.0 score is not necessarily equal to the 0.5 difference between a 4.5 score and 4.0. Seemingly precise differences (0.1, 0.2, and 0.5 and so on) are not precise at all because each outcome-core service area combination has its own unique scaling. Therefore, when these numbers are subtracted or added or divided or multiplied, the results are ambiguous and fuzzy, regardless of how exact the digits appear. A figure, say 0.5, seen in one calculation won’t be the arithmetically equal to 0.5 from another calculation due to the nature of ordinal numbers. Weird, I know. But true.

In case the workings of ordinal scales seem too mysterious, let me use an everyday example, the 1-to-10 pain scale used in the healthcare field. This is also an ordinal scale. If you’ve ever been asked this question about your level of perceived pain, you might realize the 10 digits are pretty artificial. If you respond with 7, are you sure the pain you are experiencing is 3 pain-points higher than 4? And exactly 3 pain-points less than 10? I doubt it. You are merely expressing higher versus lower based on a familiar numbering scale. The 10 digits are not necessarily evenly spaced. That’s the problem with ordinal data. We’re completely blind to the actual spacing between measurement points. (You can think of an ordinal scale as a ruler with irregularly spaced, sequentially numbered tick-marks.)

Besides the irregularity of ordinal numbers, there will also be inconsistencies in how survey respondents interpret Likert scales. Consider respondents who really, really, really agree with a statement. The only answer they can choose is 5-Strongly Agree, the same choice that respondents who only really agree will make. Thus, there is no guarantee that respondents selecting the same Likert answer actually share identical opinions or attitudes. Nor is there a guarantee that respondents with identical opinions/attitudes will select the same answer on the Likert 5-point scale. Due to the rubberiness of ordinal numbers and to the subjectivity of respondents, it’s a mistake to read too much into digits in the project’s scores and averages, especially digits to the right of the decimal place.

Despite what I just wrote, libraries will be comparing, dissecting, and otherwise poring over Project Outcome scores using the techno dashboard the project provides. So let me finish by echoing an important point made in the webinar. Among this set of such slanted data the most accurate data are the less positive responses. For this reason a below average library should be thankful to have enough dissatisfied attendees to spur the library to explore ways to improve services and increase patron satisfaction.

Ironically, libraries receiving negative responses are getting more bang for their survey buck than libraries with all positive responses. If stakeholders for a library that is rated below average say, “But your attendees reported worse outcomes than your peers” the library administrators should respond, “Our patrons are more discriminating than other libraries’ patrons, an attribute we are eternally grateful for as it helps us improve!” Not that the data show this one way or another, but they certainly don’t disprove this. It is a more responsible interpretation of comparisons with the project’s national averages than presuming that the library is sub-par. For reasons already given, libraries deciding that comparisons with these national averages are irrelevant stand on firm methodological ground.

Well, what else can I say about this project? No question it is headed in a promising direction! While my aim here is raising issues not covered in the webinar, I hope readers will appreciate how professionally the Project Outcome group has approached its task. PLA has a tradition of striving to assess the status of public libraries responsibly, a tradition that includes the illustrious Public Library Inquiry conducted more than 70 years ago.6 Project Outcome team members or participating libraries who have not perused one of the dusty volumes from that study certainly should!

My concern is that the project could have unintended consequences if its findings are seen as too generic, slanted, or naive. The Project Outcome approach definitely has its place, mainly as the perfect starting point for local library self-evaluation. Libraries will prove themselves to be data-savvy by following the project’s advice to look at findings with a critical eye. And by ignoring the advice to portray the findings as unmistakable evidence of immense library benefits. If someone asks why your library would pass up the opportunity to brag about its Project Outcome data, the library should respond, “Oh, that’s simple. It’s because cats generally do not parachute!”

—————————

1   Smith, G. 2014. Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics. New York: Overlook Duckworth, pp. 36-7.
2   Wainer, H. 2011 Uneducated Guesses: Using Evidence to Uncover Misguided Educational Policies. Princeton, NJ: Princeton University Press. p. 156. Italics in original.
3   Public Library Association. 2015. Project Outcome Survey Results: Maximizing Their Meaning. Retrieved from http://www.ala.org/pla/onlinelearning/webinars/archive/projectoutcomemaximizing (webinar). 15:20. Italics added.
4   Public Library Association. 2015. Project Outcome Survey Results: Maximizing Their Meaning. Retrieved from http://www.ala.org/pla/onlinelearning/webinars/archive/projectoutcomemaximizing (webinar). 16:15.
5   Public Library Association. 2015. Project Outcome Survey Results: Maximizing Their Meaning. Retrieved from http://www.ala.org/pla/onlinelearning/webinars/archive/projectoutcomemaximizing (webinar). 19:10.
6   Leigh, R.D. 1950. The Public Library in the United States: The General Report of the Public Library Inquiry, New York: Columbia University Press; Berelson, B. 1949. The Library’s Public: A Report of the Public Library Inquiry, New York: Columbia University Press; Garceau, O. 1949. The Public Library in the Political Process: A Report of the Public Library Inquiry, New York: Columbia University Press; Bryan, A.I. 1952. The Public Librarian: A Report of the Public Library Inquiry, New York, Columbia University Press. See also: Raber, D. 1997. Librarianship and Legitimacy: The Ideology of the Public Library Inquiry, Westport, CT: Greenwood Press.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s