In this post I’ll be telling a tale of averages gone wrong. I tell it not just to describe the circumstances but also as a mini-exercise in quantitative literacy (numeracy), which is as much about critical thinking as it is about numbers. So if you’re game for some quantitative calisthenics, I believe you’ll find this tale invigorating. Also, you’ll see examples of how simple, unadorned statistical graphs are indispensable in data sleuthing!
Let me begin, though, with a complaint. I think we’ve all been trained to trust averages too much. Early in our school years we acquiesced to the idea of an average of test scores being the fairest reflection of our performance. Later in college statistics courses we learned about a host of theories and formulas that depend on the sacrosanct statistical mean/average. All of this has convinced us that averages are a part of the natural order of things.
But the truth is that idea of averageness is a statistical invention, or more accurately, a sociopolitical convention.1 There are no such things as an average student, average musician, average automobile, average university, average library, average book, or an average anything. The residents of Lake Wobegon realized this a long time ago!
Occasionally our high comfort level with averages allows them to be conduits for wrong information. Such was the case for the average that went wrong found in this table from a Public Library Funding and Technology Access Study (PLFTAS) report:
Changes in average total operating expenditures [yellow highlighting added].2 Click to see larger image.
The highlighted percentage for 2009-2010 is wrong. It is impossible for public libraries nationwide to have, on average, lost 42% of their funding in a single year. For that average to be true almost all of the libraries would have had to endure cuts close to 40%. Or for any libraries with lower cuts (like 20% or less) there would have been an equivalent number with more severe cuts (70% or greater). Either way a groundswell of protests and thousands of news stories of libraries closing down would have appeared. These did not, of course. Nor did the official Institute of Museum & Library Services (IMLS) data show funding changes anywhere near the -42% in that table. The Public Libraries Survey data show the average expenditures decrease was -1.7%.3
Various factors could have caused the 2010 PLFTAS percentage to be so far off. I suspect that two of these were an over-reliance on statistical averages and the way the averages were calculated.
Since the percentages in the table describe annual changes, they are rates. Rates you will recall, are how given numbers compare to base figures, like miles per gallon, visits per capita, or number of influenza cases per 1000 adults. The rates in the PLFTAS table indicate how each year’s average library expenditures compare with the prior year. The chart title labels the data average total operating expenditures change.
That label is somewhat ambiguous due to use of the terms average and total together. Usually, a number cannot simultaneously be an average and a total. The percentages in the chart are based on a measure named total operating expenditures, which is the sum of staffing, collection, and other expenditures at an individual library outlet. So, total refers to totals provided by the library outlets, not a total calculated by the researchers from data for the entire group of outlets surveyed.
The title’s wording is ambiguous in another, more significant way. To elaborate let me first abbreviate total operating expenditures as expenditures, making the phrase average expenditures change. Both the chart title and my revised phrase are ambiguous because they can be interpreted in two ways:
|Average change in expenditures
||Average rate of change in expenditures
|Change in average expenditures
||Rate of change in average expenditures
Two Interpretations of the Phrase Average Expenditures Change
Tricky, isn’t it? It turns out that percentages from the PLFTAS table fall under the second interpretation, change in average expenditures. That is, the percentages are rates of change in a set of annual averages. The data in the table are the rates while the averages appear elsewhere in the PFTAS reports.4 (Later on I discuss the alternate interpretation shown in the first row of the table above.)
As explained in my prior post, averages—as well as medians, totals, and proportions—are aggregate measures. Aggregate measures are single numbers that summarize an entire set of data. Thus, we can say more generally that the PLFTAS data are changes in an aggregate measure (an average). Tracking aggregate library measures of one type or another is quite common in library statistics. Here is an example:
Annual library visit totals and annual rate of change for U.S. public libraries.5 Click either chart to see larger image.
The upper chart tracks annual visit totals (aggregate measures) and the lower tracks rates of change in these. The annual rate of change in any measure, including aggregate measures, is calculated as follows:
This is exactly how the PLFTAS researchers calculated their—oops…I almost typed average rates! I mean their rates of change in the averages. They compared how much each year’s average expenditure level changed compared to the prior year.
In the earlier table the alternative interpretation of the phrase average expenditures change is average rate of change in expenditures. This type of average is typically called an average rate, which is short-hand for average rate of change in a given measure. An average rate is an average calculated from multiple rates we already have on hand. For example, we could quickly calculate an average rate for the lower of the two line charts above. The average of the 5 percentages there is 3.0%. For the rates in the PLFTAS table the average is -9.6%. In both instances these averages are 5-year average rates.
But 5-year rates aren’t very useful to us here because they mask the annual details that interested the PLFTAS researchers. We can, however, devise an average rate that incorporates detailed annual expenditure data. We begin by calculating an individual rate for each of the 6000-8000+ library outlets that participated in the PLFTAS studies following the formula on the left side of the table above. We do this for each of the 5 years. Then, for each year we calculate an average of the 6000-8000+ rates. Each of the 5 resulting rates is the average rate of change in total operating expenditures for one year.
Obviously, tracking how thousands of individual cases, on average, change each year is one thing, and tracking how a single aggregate measure like an average or total changes is quite another. The chart below shows how these types of rates differ:
Three Different Types of Library Expenditure Rates. Click to see larger image.
The data are for 9000+ libraries that participated in the IMLS Public Libraries in the U.S. Survey in any of the 5 years covered. Notice that rates for the aggregate measures (red and green lines) decrease faster over time than the average rate (blue line). Since thousands of individual rates were tabulated into the average rate, this rate is less susceptible to fluctuations due to extreme values reported by a small minority of libraries.
On the other hand, rates for totals and averages are susceptible to extreme values reported by a small minority, mainly because the calculation units are dollars instead of rates (percentages).6 This susceptibility would usually involve extreme values due to significant funding changes at very large libraries. (A 1% budget cut at a $50 million library would equal the entire budget at a $500,000 library, and a 10% cut would equal a $5 million dollar one!) Or fluctuations could be caused simply by data for two or three very large libraries being missing in a given year. For the PLFTAS studies, the liklihood of non-response by large systems would probably be higher than in the IMLS data.
The other striking thing visible in the line graph above is how trends in rates of change in totals and averages (red and green lines) are nearly identical. So, tracking rates in average funding pretty much amounts to tracking total funding. (Makes sense, since an average is calculated directly from the total.)
Now the question becomes, which type of rate is better for understanding library funding changes—rate of change in an average or an average rate? I honestly cannot say for sure. Clearly, each can slant the outcome in certain ways, although that isn’t necessarily a bad thing. It all depends in what features of the data we’re hoping to represent.
Regardless, the lesson is that an unexamined average can be very deceptive. For this reason, it’s always smart to study the distribution (spread) of our data closely. As it happens, staring out of the pages of one PLFTAS report is the perfect data distribution for the -42% mystery discussed here. Beginning with the 2009-2010 edition the PLFTAS studies asked library outlets to report how much funding change they experienced annually. The responses appear in the report grouped into the categories appearing in the left column of this table:
Distribution of changes in 2010 library funding from PLFTAS report.7
Presuming the data to be accurate, they are strong evidence that the -42% average decrease could not be right. The mere fact that funding for 25% of the library outlets was unchanged lowers the chances that the average decrease would be -42%. Add to this the percentages in the greater-than-0% categories (top 5 rows) and any possibility of such a severe decrease is ruled out.
This argument is even more compelling when visualized in traditional statistical graphs (rather than some silly infographic layout). The graphs below show the distributions of data from the table above and corresponding tables in the 2011 and 2012 PLFTAS reports. The first graphic is a set of bar charts, one for each year that PLFTAS8 collected data:
Distributions of budget changes reported in 2010 – 2012 PLFTAS reports9 represented as bar charts.
Click any chart for larger image.
Perhaps you recognize this graph as a trellis chart (introduced in my prior post) since the 3 charts share a single horizontal axis. Notice in that axis that the categories from the PLFTAS table above are now sorted low-to-high with 0% in the middle. This re-arrangement lets us view the distribution of the data. Because the horizontal axis contains an ordered numeric scale (left-to-right), these bar charts are actually equivalent to histograms, the graphical tools of choice for examining distributions. The area covered by the adjacent bars in a histogram directly reflect the quantities of data falling within the intervals indicated on the horizontal axis.
From the bar charts we see that the distributions for the 3 years are quite similar. Meaning, for one thing, that in 2010 there was no precipitous drop or anything else atypical. We also see that the 0% category contains the most outlets in every year. After that the intervals 0.1 to 2% and 2.1 to 4% account for the most outlets. Even without summing the percentages above the bars we can visually estimate that a majority of outlets fall within the 0% to 4% range. Summing the 2010 percentages for 5 categories 0% or higher we find that 69% of the outlets fall within this range. For 2011 the sum is also 69% and for 2012 it is 73%.
Visually comparing the distributions is easier with the next set of graphs, a line chart and a 3-D area chart. I usually avoid 3-D graphics completely since they distort things so much. (In the horizontal axis, can your eyes follow the 0% gridline beneath the colored slices to the back plane of the display?) Here I reluctantly use a 3-D chart because it does give a nice view of the distributions outlines, better than the line chart or separate bar charts. So, I hereby rescind my policy of never using 3-D graphics! But I stick by this guiding principle:Does the graphical technique help us understand the data better?
Distributions of budget changes reported in 2010 – 2012 PLFTAS reports10 as line chart and 3-D area chart. Click either chart for larger image.
Notice that the horizontal axes in these charts are identical to the horizontal axis in the bar charts. Essentially, the line chart overlays the distributions from the bar charts, confirming how similar these three are. This chart is also useful for comparing specific values within a budget change category or across categories. On the other hand, the closeness of the lines and the numerous data labels interfere with viewing the shapes of the distributions.
Here’s where the 3-D chart comes in. By depicting the distributions as slices the 3-D chart gives a clear perspective on their shapes. It dramatizes (perhaps too much?) the sharp slopes on the negative side of 0% and more gradual slopes on the positive side. Gauging the sizes of the humps extending from 0% to 6% it appears that the bulk of library outlets had funding increases each year.
So, there you have it. Despite reports to the contrary, the evidence available indicates that no drastic drop in public library funding occurred in 2010. Nor did a miraculous funding recovery restore the average to -4% in 2011. (Roughly, this miracle would have amounted to a 60% increase.) Accuracy-wise, I suppose it’s some consolation that in the end these two alleged events did average out!
1 Desrosieres, A. 1998. The Politics of Large Numbers: A History of Statistical Reasoning. Cambridge MA: Harvard University Press. See chapters 2 & 3.
2 Hoffman, J. et al. 2012. Libraries Connect Communities: Public Library Funding & Technology Study 2011-2012, p. 11, [yellow highlighting added].
3 Based on IMLS data the 2009 average expenditures were $1.19 million and the 2010 average was $1.17 million, a 1.7% decrease. Note that I calculated these averages directly from the data. Beginning in IMLS 2010 changed the data appearing in their published tables to exclude libraries outside the 50 states and entities not meeting library definition. So it was not possible to obtrain comparable totals for 2009 and 2010 from those tables.
4 I corresponded with Judy Hoffman, primary author of the study, who explained the calculation methods to me. The figures necessary for arriving at the annual averages appear in the detailed PLFTAS reports available here.
5 Lyons, R. 2013. Rainy Day Statistics: U.S. Public Libraries and the Great Recession. Public Library Quarterly. 32:2, pp. 106-107.
6 This is something akin to political voting. With the average rate each library outlet submits its vote—the outlet’s individual rate of expenditure change. The range of these will be relatively limited, theoretically from -100% to 100%. In practice, however, very few libraries will experience funding increases higher than 40% or decreases more severe than -40%. Even if a few extreme rates occur, these will be counter-balanced by thousands of rates less than 10%. Therefore, a small minority of libraries with extreme rates (high or low) cannot sway the final results very much. With the calculation of annual averages each of the libraries vote by expenditures dollars. These “votes” have a much wider range—from about $10 thousand to $100 million or more. With aggregate measures like totals, means/averages, and medians, each library’s vote is essentially weighted in proportion to its funding dollars. Due to the quantities involved, aggregate library measures are affected much more by changes at a few very larges libraries than by changes at a host of small libraries.
7 Adapted from Bertot et al. 2010. 2009-2010 Public Library Funding & Technology Survey: Survey Findings and Results, p. 61.
8 From annual Public Library Funding and Technology Access Survey: Survey and Findings reports. See table 66, p. 61 in the 2009-2010 report; table 53, p. 54 in the 2010-2011 report; and table 57, p. 65 in the 2011-2012 report.