Honest-to-Goodness Transformation

A while back, in his 21st Century Library Blog Steve Matthews commented on some data appearing in a report entitled The Library in the City published by the PEW Charitable Trusts Philadelphia Research Initiative. Dr. Matthews was puzzled by an inconsistency between statistical trends highlighted in the report and standard per capita circulation, visits, and Internet computer measures. He noted, for example, that among the libraries studied Columbus Metropolitan Library had the greatest cumulative decline in visits (-17%) over the seven year study period. Yet, in 2011 Columbus ranked 2nd in the group on visits per capita. The opposite was true for the Enoch Pratt Library in Baltimore. Although the library showed the second highest cumulative increase in visits (at 25%), its 2011 per capita visit rate was the lowest in the group. Curious patterns, indeed.

There are a couple of statistical dynamics at play here. Matthews identified the first one when he stated that “the parameters of the data collection…determine the outcome [of an analysis].” This is particularly true for benchmark comparisons whatever the topic of the comparisons (vacation resorts, automobiles, wines, most hated corporations, and so on). The results always depend on who’s included and excluded in the comparisons.

The PEW Research Initiative’s comparison group was a purposive sample of 15 large urban public libraries in the U.S. If we expand that to U.S. public libraries reporting $25 million or higher total expenditures in 2009 IMLS data (N = 64), then Columbus’ visit rate ranks 4th. Looking at libraries serving populations of 500,000 or more (N = 80), Columbus’ moves to 5th place. Granted, the IMLS data are older than the data in the PEW report. But the same principle applies. So, predicting place rankings from raw statistics can be a bit iffy.

But there’s a more fundamental reason for the puzzle Matthews described. Per capita data are rates, and rates are peculiar mathematical things. In fields like epidemiology, rates (such as the prevalance of cancer among adult males over a particular period) provide a useful framework for tracking and making comparisons among regions, countries, and time also. Expressing data as rates puts each entity on a par by adjusting for the size of the relevant population base. Still, you have to be careful when interpreting rates. For instance, rate comparisons among disparate regions, like very small versus large cities, can lead to wrong conclusions. Adding one library computer or case of swine flu in a small town can easily shoot that town’s per capita rates up. In a large city, a brand new swine flu case or library computer will hardly be blips on the statistical charts.

This example suggests one reason for the disconnect Matthews observed between visit and circulation trends and current per capita statistics. It takes a lot to get a large urban library per capita rate to change since the rate’s divisor/denominator is millions of capitas! Incidentally, this is why the chart from the PEW report shown below, and the lot of similar charts circulated in librarydom in recent years, are such gross exaggerations:

PEW Philadelphia Graph

Source: The Library in the City, 2012, PEW Trusts Philadelphia Research Initiative.
Click for larger image.

Because the initial baselines (reported count of computers or sessions from which cumulative growth is calculated) are much smaller than baseline figures for visits and circulation, charts like this one are misleading. The chart implies that demand for computers outstrips demand for other services, something these data cannot substantiate. Which would you worry about more—our gaseous sun expanding 12% over the next seven years or the moon growing by 80%? Well, maybe this isn’t a good analogy due to gravity, solar flares, the closeness of the moon to mother earth, and all of that. But you get the idea. A 10% increase from a baseline of 1,000,000 library visits is a much bigger deal than a 100% increase from a baseline of 100 installed computers or from 10,000 computer sessions. Visits and circulation counts are still the elephants in the libraries, so to speak.

Growth rates for any start-up project are high early on and flatten out over time as the lower line in the graph below illustrates. In the upper line notice how the initial years boost the cumulative rate up. Half (67%) of the 10-year cumulative growth occurred in the first two years. The other half took eight additional years. Again, this is due to low baseline figures in early years. So, a fairer portrayal of growth in computer sessions would be rates in later years when the trend had leveled out.

In a start-up project annual rates begin high and level out over time. Cumulative growth rate always entails early-year boosts. Click for larger image.

Returning to per capita rates, earning, say, 2/10ths of a point increase in a given rate means a library has to increase a particular statistical count by 20% times the library’s service area population census. In very large cities this can be a mammoth task.

So what are rates, anyway? Maybe you’ve noticed nowadays how library marketeers love using the romantic terms transformation and transformative. Well, in per capta rates we have an honest-to-goodness example of such a thing. Per capita calculations are transformed measures. That is, they are quantities re-expressed relative to (divided by) a standard unit—like hours, dollars, people, and so on. The result of this re-expression (and division) is that the data lose their original units of measure—which we could call their dimensionality. Due to this transformation, miles per gallon tells something about fuel efficiency but nothing about distance traveled. And books per capita tells us… Uh, what in the heck does it tell us, anyhow? Whatever that is, a books per capita statistic gives no clues about the actual size of a library collection.

On one hand, per capita transformations make the data more useful by enabling comparisons. On the other hand, these transformations “sap” some of the information from the data, namely, their original magnitudes (sizes). The reduced data end up being statistically unrelated to non-rates data (but the rates remain statistically related to other rates). So, the disconnect that Steve Matthews saw is primarily due to the basic nature of per capita measures, specifically, to their changed dimensionality.

The charts below illustrate this. For this first set of charts I augmented the 15-library sample from the PEW report by arbitrarily adding nine more large U.S. urban libraries.1

The trendline in chart 1A shows the fairly direct relationship between total operating expenditures and total visits. As expenditure levels increase so do visit counts, directly and positively. In chart 1B, which looks at visits per capita, notice that this relationship completely disappears. When expenditure levels increase, per capita visits do whatever they want! There is no real correspondence between the two. (Well, at least not a linear one.)


Click for larger image. Rest cursor over any circle in larger image to see individual library data.

Charts 1C and 1D graph annual measures for the five year span from 2005 to 2009 in single plots to confirm that patterns seen in charts 1A and 1B aren’t confined to just 2009. The same basic patterns are evident in all four charts.

The same patterns also appear in Chart 2A through 2D for circulation and circulation per capita. To test whether these pertain only to the largest urban libraries, charts 2C and 2D contain plots for the 165 U.S. libraries with total operating expenditures of $10 million or more in 2009.2   The results are nearly the same, except the slope of the trend line in the 24-library sample (chart 2B) is definitely negative, but only slightly so for the larger sample (chart 2D).  (I invite the reader to create scatter plots for medium and small libraries to see if this pattern applies to those also.)


Click for larger image. Rest cursor over any circle in larger image to see individual library data.

Charts 3A through 3D show the combined data for the larger sample for 2005 through 2009. In the per capita measures the negative slopes are not as severe as in chart 2B.3  But they’re still there. So the inverse relationships that Steve Matthews wondered about were not his imagination. They are in the data.

Visits & Circ by Expend

Click for larger image. Rest cursor over any circle in larger image to see individual library data.

Matthews was also interested in how Internet computers per capita worked, so these are plotted in the next graph for the larger sample using 2009 data. Finally, for curiosity’s sake, charts 5A through 5D show how visits and circulation relate to service area population, which, like expenditures, is a respectable proxy for library size. As you can see, the trend lines in the charts 5B and 5D have ample negative slopes.

Public Terminals

Click for larger image. Rest cursor over any circle in larger image to see individual library data.

Besides the trend lines in the charts above, it’s also possible to describe relationships among library data using statistical correlation coefficients. The bar chart below gives Pearson product-moment correlation coefficients for several variable pairs (like total operating expenditures and visits) for the sample of 24 urban libraries (lighter green) and the sample of 165 public libraries with expenditures of $10 million or more (darker green).

Expend & Population Correlations

Click for larger image.

Let’s go over the dry details. Total operating expenditures is highly positively correlated with visits, circulation, Internet computer counts, and population. These positive correlations are higher in the 165-library sample than the 24-library sample.

On the other hand, correlation coefficients for expenditures with per capita visits and circulation, and Internet computers per 10K, hover around zero for the larger sample. Thus, expenditures have no relationships with these rate-based library statistics (no linear relationships, that is). This matches Matthew’s observation of low correspondence between these variables in the PEW data.

With the 24-library sample, expenditures have a moderate negative correlation with circulation and Internet computer rates, but a near zero correlation with visits per capita. All three per capita/per 10K measures are definitely negatively correlated with population in both samples. With either sample, the larger the population, the lower the per capita measures tend to be. The most obvious interpretation is that libraries serving the largest populations have a hard time matching rates of libraries serving smaller populations. It also means that if you’re a large urban library, you shouldn’t set performance goals of increasing your per capita rates by X points, unless you’re very confident about repeated years of population decline.

So there you have it. It’s a statistical thing, mostly. Increases or declines in regular library measures will barely be reflected in per capita rates.

As for Steve Matthews’ suggestion that variability in per capita library measures is an indicator of library local uniqueness, I suspect that is true by definition. Matthews’ idea sounds something like what was historically known as nominalism. Check out the story about William of Occam and nominalism in Alain Desrosières’ wonderful book, The Politics of Large Numbers (pp. 68-70).

1  I included the following pubic libraries to make the count an even 24 for some other charts I am working on: Cincinnati, Cleveland, Denver, Houston, Indianapolis, Las Vegas, Miami, Minneapolis, and San Diego. I really had no compelling reasons for including these particular libraries and excluding other large ones. I did try to exclude large suburban or county systems like Cuyahoga, Broward, Naperville, and the like.
2   My apologizies to New York Public Library. Their humongous statistics extend the chart axes farther than I wanted. So I excluded them from this group.
3  This is a good example of why it’s almost always better to have larger samples rather than smaller. With more data the patterns and trends tend to settle out due to lowered variability. So we can have a bit more confidence in what we see.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s