It’s great to see other librarians advocating for the same causes I harp on in this blog. I’m referring to Sarah Robbins, Debra Engel, and Christina Kulp of the University of Oklahoma, whose article appears in the current issue of College & Research Libraries. The article, entitled “How Unique Are Our Users?”1 warns against the folly of using convenience samples. It implores library researchers to honestly explain the limitations of their studies. And the authors are resolute about the importance of understanding the generalizability of survey findings, a topic which also happens to be the main focus of their study.
I bring up their article for a different reason, however. It is an example of how difficult and nuanced certain aspects of research and statistics can be. Despite the best of intentions, it’s amazingly easy to get tripped up by one or another detail. Robbins and her colleagues got caught in the briar patch that is statistics and research methods. I say so because the main conclusions reached in their study are not actually borne out by their survey results.
I’ll give the short explanation now and then follow with the whole story. Basically, the researchers argue that when certain user behaviors or perceptions are thought to be essentially the same regardless of the users’ home institutions, survey findings from one institution can be applied (“generalized”) to other institutions. Then they present findings from their own survey demonstrating that certain user behaviors and perceptions are, indeed, the same across institutions. However, the statistical generalizability2 of survey data depends solely on how the data were collected, that is, on the sampling method(s) used. Only when one’s own institution has been included in the initial sample selection process can survey results be justifiably generalized to one’s own institution.
In addition, the researchers’ method for confirming user behaviors and perceptions to be essentially the same across institutions relied on a mistaken understanding of the statistical procedure they used. They assumed the procedure, statistical significance testing, proved something that it cannot actually prove.
What Survey Data Can Tell Us
Now for the longer story. In their study Robbins and her colleagues sought to answer this research question: “To what extent can survey data from other institutions be generalized to one’s own institution?” By this I assume they mean, “Should survey data from other institutions be viewed as trustworthy estimates of actual attributes or states of affairs at one’s own institution?”
Let me suggest an example that might help answer this question: Suppose we recruit two survey research teams to study opinions about the legendary Steve Jobs. We assign one team to conduct telephone interviews of individuals from a randomly selected sample of 100 adult residents of our community. The other team is assigned to conduct 100 woman-on-the-street interviews in various locations in the community (that is, we employ a convenience sample). Tallying the survey results, we learn that all respondents in both surveys reported having high admiration for Steve Jobs. This leads us to announce, “There is no reason to conduct additional surveys on this topic. It is clear that there is unanimity about Jobs. In fact, the responses were so consistently positive that we are confident that they are applicable to other U.S. communities also.”
But what evidence is there to confirm that opinions about Jobs are unanimous nationwide? At first, we might think that Steve Jobs’ reputation makes this unanimity credible. However, our beliefs about Jobs’ reputation are not objective evidence. Is the unanimity observed in our data evidence that the surveys provide a balanced account of opinions nationwide? No, it is not. For one thing, we obtained unanimous responses from the convenience sample. Yet we know that that sampling method is usually grossly inadequate due to built-in biases.
More importantly, to obtain a valid measurement of a phenomenon our measurement must span the entire scope of that phenomenon. If we wanted to find out how much a man weighs, we wouldn’t just weigh his big toe, unless we have a really tried-and-true formula for calculating body weight from big toe weight. Neither would we ascertain his weight by weighing a stranger standing nearby. Nor would we assume that the man’s weight will be equal to the average of the last ten people we weighed.
The same goes for survey sampling. Our sample must be selected from the specific population we’re interest in, in this case the U.S. population. And the sample must span the breadth of that population. Only by doing so are we justified in making the leap (generalizing or inferring) from the sample to the population.
Nothing in the content of survey data—including uniformity or diversity of responses—justifies our generalizing from the data to one or another population. To put this another way, data are not psychic. They cannot divine or foretell information about some other realm or context beyond their own. Examining patterns in a set of data we have on hand tells us very little about patterns that will appear in another setting. Sure, trends in the data might hint at possible scenarios in other settings. But these trends cannot confirm the scenarios to be factual.
The Bane That Is Statistical Significance
Yet Robbins et al. argued that the content of their survey data were indicators both of that survey’s statistical generalizability as well as the statistical generalizability of other surveys. They based their argument on something called a chi-square test. This is a statistical tool that identifies and confirms statistical trends in categorical data. A key component of this tool is statistical significance testing, a topic I also discussed in a recent blog entry.
I beg the reader’s pardon for the tedious material I am about to present. However, we have to cover the generally indigestible details of statistical significance testing in order to understand how the researchers got caught in its snare. So here goes. Statistical significance testing is basically a pass-fail quality test for survey data. The testing confirms, to a reasonable degree, that trends observed in data are not artifacts caused by sampling side-effects. These artifacts are often described in phrases that attribute trends in the data to “the luck of the draw” or to “chance alone.” Other phrases describe the trends as “statistical noise” or having occurred “by accident.”
Statistical significance testing begins with an assumption established for the sake of argument and known as the null hypothesis. The word null is used because this assumption describes a situation of no difference. An example null hypothesis is:
There are no differences in the average number of electronic journals accessed by faculty from one institution compared with averages for faculty at other institutions.
This hypothesis supposes (remember, this is all for the sake of argument) that the journal access averages will be essentially equal among all of the institutions studied.
The alternative hypothesis takes the contrary stance:
There are differences in the average number of electronic journals accessed by faculty from one institution compared with averages for faculty at other institutions.
The hypothesis supposes journal access averages among the institutions studied will be unequal (that is, they will vary).
Typically, researchers want to be able to dismiss (technically, to decide to reject) the null hypothesis in favor of the alternate hypothesis. When data for a particular survey questionnaire item pass statistical significance testing, they are declared statistically significant, meaning that differences observed between two or more groups (such as faculty from institution A compared to institution B) for the item are likely to reflect real differences rather than statistical noise.
Basically, this outcome is a bet based on calculated probabilities that the null hypothesis is untrue. The outcome is considered to be evidence supporting, but not directly proving, the alternative hypothesis—in this example, the claim that there really are differences in electronic journal access averages depending on institution.
On the other hand, when data for a particular survey item fail significance testing, they are declared statistically insignificant, meaning that the differences observed are most likely statistical noise. In this case, the null hypothesis has not been dismissed (rejected). Thus, this outcome provides no evidence supporting the alternate hypothesis.
However—and this is a big however—a decision not to reject the null hypothesis is not the same as proving the null hypothesis to be true. In statistical significance testing we assume the hypothesis to be true only for arguments sake. This whole approach is completely silent (or you might say agnostic) about the actual trueness or falseness of the null hypothesis. This is why statisticians are careful to say, “We fail to reject the null hypothesis” instead of, “We accept the null hypothesis to be true.”3
So that’s pretty much the story on statistical significance, tangled as it is. Now let’s see how statistical significance testing fit in the study by Robbins and her colleagues.
Getting Tripped Up
The researchers examined responses of engineering faculty to eight questionnaire items, noting how these varied among the sixteen institutions surveyed. They looked at which items passed statistical significance testing and which ones failed. They considered any items passing significance testing to be poor candidates for their purpose, which was to enable librarians to describe their own institution using another institution’s survey results. Since institutional differences on these items were real rather than statistical noise, the researchers concluded that for these items librarians could not rely on other institutions’ survey results.
Conversely, the researchers viewed items failing significance testing as good candidates for their purpose (described above). Because differences observed between institutions on these items were very likely due to statistical noise, the researchers concluded that the sixteen surveyed institutions were essentially the same on the items, as other non-surveyed institutions would be also. In other words, they believed that the outcome of statistical significance testing proved the null hypothesis of no difference to be true.
This is the snare Robbins and her colleagues got caught in. As I explained already, statistical significance testing does not evaluate the actual trueness of the null hypothesis. As statistician Howard Wainer describes this:
…even if two individuals are not ‘statistically different’ it does not mean that the best estimate of their difference is zero.4
For this reason the researchers’ chi-square results are not a sound basis for concluding that all academic institutions are essentially the same on these items.
Who Needs Statistical Significance Anyway?
In their defense, I don’t think the researchers took the logic of statistical significance completely to heart. They ignored it when it seemed unreasonable, as these statements indicate:
…while the chi-square analysis indicates that, regardless of the institution surveyed, researchers could expect to receive similar results [if they were to survey faculty at other institutions], the results themselves suggest otherwise.5
While the association between institution and responses to this question [about departmental duties of faculty] were statistically insignificant for the most part, librarians at the institution are best poised to know the job requirements of their institutions’ faculty…6
The importance that faculty place on assistance from library personnel was not shown to be statistically significantly associated to institution (p=0.123). This suggests that, regardless of the institution surveyed, researchers could expect to receive similar results [if they were to survey faculty at other institutions]. However, a look at figure 2 tells a different story.7
And they freely analyzed between-institution variation of some of the items without regard for statistical significance. The graph below is an example:
Source: Robbins et al., 2011, ‘How Unique Are Our Users?’ College & Research Libraries.
Click for larger image.
Pondering these visible differences may well have led Robbins and her colleagues to some questions not addressed in the article: If a single institution is going to adopt survey data from other institutions as their own, what specific data values should they adopt? Should average responses from the sixteen institutions in their study be used as the best estimates? Or perhaps averages gleaned from past or even future studies? Or should an institution choose one particular surveyed institution to extrapolate from? If so, on what criteria should they base their choice?
Returning to the more general theme of the study, I should mention a second question the researchers posed:
…is it reasonable to believe that faculty members or students at one institution are all that different from faculty and students at another institution?8
Some user perceptions or behaviors may well be uniform, for example, the belief by faculty that the electrical outlets in their offices ought to work, or the belief by students that tuition hikes are undesirable.
However, for the majority of issues pertinent to library assessment there will always be variation, within institutions and between them. To understand and appreciate this variation, even for purposes of ultimately ignoring it, we have to measure it. And this measurement has to occur in the actual settings and situations we want to understand.
1 Robbins, S., Engel, D. and Kulp, C. (2011). How unique are our users? Comparing responses regarding the information-seeking habits of engineering faculty, College & Research Libraries, 72(6), 515-532.
2 Statistical generalizability is also referred to as external validity. Though this second term came to be used in connection with survey research studies, it was first applied to experimental and quasi-experimental studies. With these studies, external validity is contrasted with internal validity, which refers to how well experimental conditions were controlled so as to make the results trustworthy.
Generalizability (or generalization) is a generic term that includes statistical generalizability. See Pollit, D.F. and Beck, C.T. (2010). Generalization in quantitative and qualitative research: Myths and strategies, International Journal of Nursing Studies, 47, 1451–1458.
3 See Luk Arbuckle’s blog entry for a thorough discussion of the problem of proving the null hypothesis.
4 Wainer, H. (2009). Picturing the uncertain world: How to understand, communicate, and control uncertainty through graphical display, Princeton, NJ: Princeton University Press, p. 76.
5 Robbins, S. et al. (2011). p. 526. (Blue emphasis added.)
6 Robbins, S. et al. (2011). p. 521. (Blue emphasis added.)
7 Robbins, S. et al. (2011). p. 526. (Blue emphasis added.)
8 Robbins, S. et al. (2011). p. 519.