E Pluribus Unum

This simple statement is one of several “myths” appearing on GeekTheLibrary:

The busier the library, the more money it receives.

GeekTheLibrary is concerned that the general public mistakenly believes libraries are funded based on how much they are used by patrons. Perhaps their concern is well founded, I don’t know. But the statement happens also to be a great jumping off point for discussing ways to look at library data.

As I described in my prior post, scatterplots are graphical tools for exploring relationships between two characteristics of single things, such as heights and weights of children or educational levels and reading habits of library non-users. Inspired by the GeekTheLibrary statement, we can use these tools to help answer this question: What is the relationship between U.S. public library visits and expenditures?

Chart 1 provides a preliminary answer to the question. The chart presents IMLS U.S. public library data for libraries reporting these two statistical items for the operational year 2008. Basically, it appears that visits and library expenditures tend to increase and decrease together. More highly funded libraries have higher visit counts, and less funded libraries have fewer. Beyond this, the chart isn’t very informative because most of the circles (each one representing a single library) are bunched up in the graph’s left corner. This is due to the chart axes extending to very high numbers to accommodate the largest libraries.

Click chart for larger image. Move cursor over circles in larger image to see individual library data.

Separating the libraries into smaller groups based on expenditures will help reveal more data. Chart 2A shows only libraries with less than $50,000 total annual expenditures. “N=1,944” in the chart title indicates how many libraries are included. Even with this smaller group of libraries Chart 2A requires fairly high axes values—up to 60,000 for the visits axis. To focus on the majority of libraries in this group, we can omit the high values from the axis. All we need is a cutoff value for visits counts. The vertical dotted line in Chart 2A suggests one.1

Click chart for larger image. Move cursor over circles in larger image to see individual library data.

Chart 2B shows only libraries that fall to the left of the vertical dotted line in Chart 2A, that is, libraries in this expenditure group reporting 16,000 or less visits. You can see how a shorter visits axis spreads the circles out. (Notice now that N = 1,867.) Still, a large contingent of libraries, those with expenditures of $15,000 or less, are clustered in the left corner. We could separate this group further, but let’s leave them as they are for now.

Click chart for larger image. Move cursor over circles in larger image to see individual library data.

Without resorting to any official statistical formulas, what does the arrangement of the circles in Chart 2B tell us? Mainly, it demonstrates that individual libraries vary a lot, relatively speaking. There are libraries funded at nearly $50,000 with quite small visit counts, and others with large visit counts. And a few libraries funded $5,000 or less with visits counts of almost 10,000. One library with about $2,000 in expenditures reported 10,000 visits even! You could say that the relationship between expenditures and visits is “all over the map!” But the libraries still are not evenly spread. For instance, there are many more higher-funded libraries ($30,000 to $50,000) reporting lower visit counts (6,000 or less) than lower-funded libraries ($10,000 or less) reporting higher visit counts (6,000 or more).

Speaking of official statistical formulas, the angled line (a regression line) in Chart 2B comes from exactly that. For our purposes we can ignore the intricacies of these formulas. We just need to know that a regression line is intended as a summary of the relationship between two measures, like visits and expenditures. The line leads us to the same conclusion suggested for Chart 1: For the libraries depicted in Chart 2B, as visits increase or decrease, so do expenditures, and vice versa. That is, as long as you’re willing to go with a blanket statement about the data.

Charts 3A and 3B depict libraries with expenditures from $50,000 to $99,999. Here we see the same basic patterns as in the prior charts, although the data in chart 3B are more evenly spread out. A regression line is also visible in that chart. This, again, serves to draw one conclusion from many details!

Click chart for larger image. Move cursor over circles in larger image to see individual library data.

Click chart for larger image. Move cursor over circles in larger image to see individual library data.

Notice in this chart and also in Chart 2B that hardly any circles fall exactly on the regression lines. In fact, nearly all of the circles stray away from the lines, some quite far. While the lines represent (or re-present) the data as a whole, they don’t really express the plurality of the data.

Let’s look at a different example in Chart 3C, a histogram showing how libraries in the $50,000 to $99,999 expenditure group fall into bins (or buckets) sized at 1,000 visits each. Note the vertical arrow indicating 12,792 as the median number of visits for the group.2  The purpose of a median (and an average, also) is to choose one typical value to characterize the group as a unit. The 64 libraries that fall into the 12,000 to 12,999 group are close to the median. Otherwise, quite a bit of the data is distant from it. But this is true by definition since medians and averages point to the middle of a group. And the middle is the location the data gather around, so to speak.

Click chart for larger image. Move cursor over bars in larger image to see library count.

Statistical formulations like these distill information about an entire group, but few group members actually conform with the exact values produced. The formulas end up describing the archetypal group member. This idea applies also to the regression lines in the charts shown here and the formulas that produced them. Formal statistics like these epitomize the group in the aggregate by underplaying the idiosyncrasies of its members.

Take a quick look at the data for the remaining expenditure groups in the multiple charts below. The same patterns are evident: Data points are scattered around quite a bit, but the regression lines all slope upward. The $10 million and over group appears less dispersed, with data points seeming to hover near the line. This is primarily due to the smaller number of group members and large magnitudes (millions of dollars and visits) involved.

Click chart for larger image. Move cursor over circles in larger image to see individual library data.

At their root, statistical calculations like medians, averages, standard deviations, regression lines, correlation coefficients, and others are generalizations. Like well-crafted article abstracts, the calculations can portray something essential about the data. Yet, they conceal the finer points of the story.

So, what is the answer to our original question? Are library funding levels related to how busy libraries are? It appears they are. Increases or decreases in one corresponds with increases or decreases in the other as long as we consider public libraries together as a group. But, there are many (idiosyncratic) exceptions.

Plus, the degree of this correspondence, represented by the angle (slope) of the regression lines, depends on the size of the library. We could try to construct a single chart for all U.S. public libraries, linking the individual lines from the charts above as segments in a larger line, left to right. (The line would probably be curved, I think.) This could give us a wonderfully accurate description of this super-group. But no matter how precise we attempt to be, we are still talking in the abstract. Discussing ideals and archetypes. Our super-line would describe all libraries in general, but no libraries in particular.


1  We can devise a cutoff value in any way that is reasonable. I decided to use a standard statistical tactic: For each expenditure group I took the average visits count and added 2 times the group’s standard deviation for visit counts, and rounded to the nearest 1000.
2  The median is a better number to use to describe the center of this set of data due to the data’s unevenness (skewness). Exceptionally high visit counts visible at the right of histogram push the average up to 14,685.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s