I borrowed the title for this entry from a 2009 study of student research practices by Randall McClure and Kellian Clink. Their study is cited in an article in the current issue of *College & Research Libraries* that Joe Matthews brought to my attention. This article is Students Use More Books After Library Instruction by Rachel Cooke and Danielle Rosenthal. Both articles explore research sources and citations that undergraduate students use in writing assignments. Though it’s this second article I want to discuss, McClure’s and Clink’s well-chosen title is too good to pass up. In fact, I’m thinking of making it the motto of this blog!

Anyway, in their article Cooke and Rosenthal report that university English composition students “used more books, more types of sources, and more overall sources when a librarian provided instruction.”1 Their statement contains two separate claims: (1) the quantity of citations used by composition students receiving library instruction differs from the quantity that composition students not receiving instruction used; and (2) this difference is due to library instruction.

Let’s look at the first of these and save the second for some other time.2 I suggest we start by applying a basic principle of information literacy, which I paraphrase here:

Don’t just take the information in front of you at face value. Try to identify possible shortcomings that might compromise the accuracy, authenticity, or relevance of the information. The fewer shortcomings the information has, the more trustworthy it is.

Put another way, you have to be sure *how you do know that* the information is true.

Statisticians also happen to be preoccupied with the relative trueness of information. And one thing they obsess over is inaccuracy that comes from measuring a subset (sample) of a larger group as a way to learn about that group. In the Cooke and Rosenthal study this larger group is all freshman English composition students enrolled in their university in the study year (or perhaps in the last couple of years). Their sample consists of those students whose papers the researchers examined.

Samples are only estimates of what is true for the larger population. The data always contain a certain amount of inaccuracy, making the sample an imperfect representation of the population.3 If you have ever played cards, you’ve experienced this same situation. You’ve seen how easy it is to be dealt a 7-card Rummy hand that contains only low cards, or only two suits, or a majority of face cards. You know hands like these are aren’t very good representations of the contents of the whole deck of cards because you know the contents of that deck.

This misrepresentation happens in the same way with samples, except we are mostly ignorant about the contents (those characteristics we want to know about) of the whole population that we are studying. So, we can’t tell how far off our sample values are.

Keeping this in mind, let’s look at these data that Cooke and Rosenthal present:

Figure 1. Source: Cooke and Rosenthal, 2011, p. 335.

Just as a Rummy hand might have one suit over-represented, the sample of non-instructed students could contain a high proportion of papers having only one citation—higher than the true proportion in the larger population. If this is the case, then the 3.2 average is too low. The average for the population might really be 4.0 or so. And the opposite could be true for the instruction group. What if, by chance, two over-zealous students in the sample each put 20 citations in their papers? In a small sample of about 60 students, these two high values would inflate the sample average. So, maybe the real figure for all instructed students is more like 4.5.

For these reasons we can’t simply judge survey data by *appearances* alone. While the 5.3 average *appears* much higher than 3.2, something like the scenarios I just described could be closer to the truth. Mind you, this is not to suggest that my alternate scenarios *actually are* true. This is merely to ask, how do you know that they are not?

Enter the statisticians. They are experts at looking for ways to make the murky waters of data clearer (kind of), in this case by taking sampling uncertainty into account. They realize that if scenarios like the ones I described can be shown to be very unlikely, this justifies our having more confidence in the sample results. They approach this challenge by putting the sample data through an extra hurdle which is essentially an additional quality test. This extra hurdle is known as statistical significance testing. If the data pass this testing, statisticians tell us we can be very (95% or more4) confident that the findings are not due to the vagaries of sampling. In other words, we can be reasonably sure that the numeric differences we observe are authentic, rather than being “due to the-luck-of-the-draw” or “attributable to chance.”

Statistical significance testing is simply a tool that allows us to address one pesky data problem (among several), the possibility that our data primarily reflect the-luck-of-the-draw. You can read about it in introductory statistics textbooks. But beware. This is a troublesome topic that can be really confusing. As explained in the video5 below, a high probability that the numbers are not attributable to chance does *not* mean we can apply the same high probability to our belief that the data are accurate and reliable. This is the odd thing about statistical significance testing. It indicates what our data *probably are not,* not what they *probably are*.

As you can tell, this topic entails a lot of *probablys*, *maybes*, *likelys*, and *possiblys,* as well as twisted logic! But we can *definitely* say that, while statistical significance testing is not the be-all and end-all of data quality assurance, it does help alleviate some of the problems connected with sampling. If Cooke and Rosenthal, and also the other researchers cited in their article, were to put their data through this extra quality test, that would be one less worry for information-literate readers. Otherwise, there will always be the nagging doubt that the differences and contrasts they report are just statistical accidents.

—————————

1 Cooke, R. and Rosenthal, D. (2011). Students use more books after library instruction: An analysis of undergraduate paper citations, *College & Research Libraries, 72*(4), p. 332.

2 The second claim is more complicated than you might think. It requires us to consider what is sufficient proof that a library program or service is the sole (or even major) cause of an observed outcome. Plus, we’d need to delve into the objectives and desired learning outcomes of the instructional program. As Joe Matthews pointed out to me, counts of citations are probably not valid indicators of the quality of the student papers. McClure and Clink raise this same question. See: McClure, R. and Clink, K. (2009). How do you know that? An investigation of student research practices in the digital age, *portal: Libraries and the Academy, 9*(1), p. 120.

3 Statisticians call this sampling error. There is also nonsampling error, which we eventually have to consider.

4 If we choose the 95% significance level as our testing criterion, and the data pass this test, this means that there is a 95%, or 19 out of 20, chance that our sample results are not statistical flukes. And there is a 5%, or 1 in 20, chance that they are flukes. At the 99% significance level, there is a 1%, or 1 in 100, chance that the data are statistical flukes.

5 I link to this video because the creator, Jose Silva, makes two excellent points: Our data are estimates; and statistical significance is “about your results coming up by accident.” Ironically, part of Silva’s explanation is wrong. His statement “Given our estimate of 320, the probability that the real effect size is zero is less than 5%” is backwards. 5% is the probability that our sample values (or any values more extreme than these) would occur, given a basic assumption we start with for the sake of argument. This assumption always involves *zero*, one way or another. For Cooke’s and Rosenthal’s data in the table above this would take the form, “Let’s assume the averages of the instructed and non-instructed groups are equal, meaning their difference is zero.”

This assumption is the groundwork for the testing, not the result of it. So the 5% applies to the data, not to the assumption. It indicates how probable our sample values would be if, in the real world, the zero-related assumption were true. For more information see Kline, R., 2004, *Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research*, Washington, DC: American Psychological Association.