Fun with Numbers

After so much stuff about evaluation theory and practice in this blog, it’s time for some fun! And what better fun is there than fun with numbers?1

Let’s begin our diversion with a graph from my prior post shown here. Looking closely, notice how some of the gold circles lie in neat, parallel bands. These bands

Click for larger image. Rest cursor over any circle in larger image to see individual library data.   Data Source: IMLS Public Libraries in the United States Survey 2009.

are more obvious in the next two charts, which ‘zoom in’ on the data by decreasing the vertical axes value ranges (see chart titles).

When I first saw this pattern, I suspected that something had corrupted the data. Double-checking, I found the data were fine, or at least they were true to the values in the original IMLS datafile. So, I decided to resort to that popular and trusty problem-solving technique—denial.  I ignored this puzzle and moved on to something else.

Click for larger images. Rest cursor over any circle in larger images to see individual library data.

Facing the Graphical Music

It’s commonly known that graphs show us things in data that we might not otherwise see. Less common is the realization that graphs can show us things we don’t want to see. This was true for the baffling pattern in these charts.

I finally confronted the problem by getting help from my brother, Rick. He’s an electrical engineer specializing in digital signal processing, which, as far as I can tell, involves sleuthing for patterns among gazillions of numbers. So, I didn’t feel guilty sending him a few hundred numbers to look at. (Actually, I think digital signal processing investigations involve taking samples from gazillions of numbers.)

I sent my brother a dataset of libraries serving community populations between 15,000 and 20,000 (from the IMLS U.S. public libraries data) along with chart 6A shown above. Let’s look at what he came up with. He started by plotting population by staff counts, seen here:

Figure 1.    Click for larger image

Always a good idea to get the general lay of the data-land by plotting a basic graph of the key variables. He also created a closer view of Figure 1, shown here:

Figure 2.    Click for larger image

You can see that the data line up parallel to the horizontal axis, leaving gaps between. This is because none of the libraries reported fractions of a staff person. As my brother put it, “These data are constrained to integer values, that is, what non-mathemeticians call whole numbers.” Makes sense.

Then he produced Figure 3 (below) which basically matches the chart at the beginning of this post labeled “Close Up View #2”. The x 10-3 at the top left of Figure 3 indicates that the vertical axis values are expressed in 1/1000ths. That is, 1.0 = 0.001, 0.9 = 0.0009, 0.8 = 0.0008 and so on. Similarly, the x 103 at the bottom right means that the horizontal axis values are expressed in thousands (15 = 15,000, 15.5 = 15,500, and so forth).

Figure 3.    Click for larger image

Next, my brother produced this chart:

Figure 4.    Click for larger image

And he explained:

Figure 4 shows all possible values of staff size per capita for all libraries having 6, 7, 8, or 9 staff members. For example, if any library has a staff size per capita of 8 people, its plotted point must lie somewhere on the red line. Likewise, if any library has a staff size per capita of 9 people, that point must lie somewhere on the top blue line. The integer-only (“quantized”) nature of the staff size and population data values manifests itself with “forbidden white bands” in the vertical axes of Figures 3 and 4.

In Figures 3 and 4 the height of the forbidden white bands, for any given population value, is simply one divided by that population value. In Figure 4 forbidden band height Δ1 is 1/16,000 = 0.00006250, and forbidden band height Δ2 is 1/19,000 = 0.00005263.

So the bands are characteristics (artifacts) of the data, produced by the fact that fractions formed from two integers can take on only certain discrete values. The areas in which values will not occur form blank bands. The bands slope downward as population increases because the width of the band at any point is the reciprocal of the population value. As population increases the bands narrow.

The Density Factor

My brother’s and my graphs faithfully reflect the patterns in the data, but the artifacts are visible only when data points for a given library measure are dense enough. Data for lots of libraries must fall within fairly small ranges of both the horizontal and vertical axes. This works for staff per capita and also for public Internet computers, for which the graphs below display even more dramatic banding:

Click for larger images. Rest cursor over any circle in larger images to see individual library data.

Even though almost all library per capita measures share this same characteristic (being the ratio of two integers), most of these measures have distributions that are too dispersed for the bands to be visible. The measure of library volumes held is an example of this:

Click for larger images. Rest cursor over any circle in larger images to see individual library data.

In none of the above three charts of volumes data are the data points dense enough to make the bands visible.

Outer Limits

So what the heck does this all mean? There’s no profound implication to be drawn, other than to recognize there are mathematical limits to the values that library statistical data can assume. Graphically, limitations due to quantized data are mostly hidden even though these apply to nearly all per capita library statistical data.

The main lesson is that it’s important to be able to explain anomalies we encounter in data. And that statistical graphs are excellent tools for bringing anomalies to our attention. This particular one turned out be mostly a non-issue, the sort of reassuring conclusion that can only be reached by investigating the details.

Now for more fun! Let’s move on to another situation where library statistics appear to be curiously constrained. Take a look at the next set of 3 charts which depicts the relationship between staffing counts and per capita staffing for libraries in communities with populations from 15,000 to 20,000. Chart 7 (at the top) shows the overall distribution for this subset of libraries. And the other two charts show closer views of these same data.

Click for larger images. Rest cursor over any circle in larger images to see individual library data.

Notice the pattern of shorter vertical lines getting taller as staff count increases. Interesting, isn’t it? My brother created Figure 7 to shed some light on this:

Figure 7.   Click for larger image.

He explains:

In Figure 7 the bottom (minimums) and tops (maximums) of the vertical lines are the staff size divided by maximum and minimum population values respectively. The height of the staff size = 9 vertical line is greater than the height of the staff size = 6 vertical line because 9 is greater than 6. The greater the staff value the greater can be the maximum of its vertical line, and the greater is the height of its vertical line. (In this figure the lines show all possible population values so there are no gaps in Figure 7’s vertical lines. The real data in Chart 7 [above] is more sparse.)

Considering multiple population groupings will show us more about this pattern. As a baseline, let’s start with the following chart of 2009 IMLS library visits data. (The chart omits 70 or so extreme (outlier) values to make the overall distribution easier to view.)

Click for larger image. Rest cursor over any circle in larger image to see individual library data.

Beam Me Up, Scotty

The next charts do something interesting with this prior chart’s data. They filter the data, restricting them to just libraries within particular population ranges. Each of the 3 distributions in the single chart below begins with libraries with community populations of 5,000 and then extends either to 10,000 20,000 or 30,000. Notice how this filtering—simply selecting a population range—extracts a subset from the larger distribution that, graphically, looks like a flashlight beam! With each additional extension of the population range, the beam widens. Fascinating! Such fun!!

Click for larger images. Rest cursor over any circle in larger images to see individual library data.

The next chart of 6 distributions includes the 3 from the above chart (in brighter blue). The chart below uses a horizontal scale ranging to 1,200,000 visits, making more white space to the right of the first 3 distributions. The lower row in the chart adds 3 other distributions with larger population sizes, culminating with 100,000. Note how wide the swath becomes.

Click for larger images. Rest cursor over any circle in larger images to see individual library data.

The chart below shows the groupings (beams) framed by the complete dataset:

Click for larger image. In larger image ctrl-click on two or more groups in legend to see combined beam patterns.

An Entirely Different Question

Now, what in the Sam Hill does this mean? When we group libraries based on population and use that same measure to calculate per capita rates of other measures (like visits, volumes, circulation and so forth) the population boundaries circumscribe the values that the per capita data can take on. In all population groupings, the lower boundary of potential values moves consistently upward (on the vertical axis) as the raw statistical measure increases (on the horizontal axis). The narrower the population grouping range, the faster this minimum increases. The wider this range, the slower the minimum value rises.

At the same time, the wider a population grouping range is, the lengthier the range of potential per capita values (the vertical lines in figure 7). So, data in wider population ranges have more “room” to vary. Since the minimum values for these wider ranges increase more slowly, these ranges create added opportunity for lower per capita values to occur. And the opposite is true for the narrower population groupings.

Notice that these are effects of our filtering (grouping) process, not the data themselves. Indeed, all of this describes how per capita data might potentially behave. How the data actually do behave is an entirely different question, as the dense and sparse areas inside the beams suggest.

How and how much grouping libraries does affect the data’s behavior is definitely relevant to the practice of comparative performance measurement and benchmarking using library statistics—when per capita rates are used. And probably to surveys conducted of libraries belonging to one or another population range. But, sorry to say, I am rather funned out right now to want to explore that any further.

The two statistical conundrums (or is that conundra?) I’ve introduced here are brain-teasers, indeed! Time to let them go and enjoy ourselves! Besides, this means that an exciting sequel is bound to be on the horizon.

1  No, this is not an April Fool’s joke. I propose this fun in all seriousness!

One thought on “Fun with Numbers

  1. I thought more about the 2nd half of this post that graphically depicts potential per capita values as thinner or fatter beams. I think the whole thing is really a reflection of the scales, that is, the number lines used on the graphs’ axes. By definition, larger numbers get higher on the scales, or extend more to the right. And, in Figure 7, the lower boundaries of the vertical bands for larger population ranges are smaller than these boundaries for smaller population ranges (as they proceed to the right on horizontal axis) because the reciprocals of larger numbers are smaller than the reciprocals of smaller numbers.

    My idea that a wider spread of a given population range allows more “room” for lower per capita values is incorrect. The real data in the narrower beams aren’t constrained. They can vary anywhere from zero (in reality) upward to infinity (theoretically). Also, it is misleading to describe the slope of the lines forming edges of the beams as moving upward “faster” or “slower” than another beam. The slope may be larger or smaller, indeed. But these numbers are not actual data. They’re only the possible range (domain) of actual data. It’s probably better to restrict the use of the term “trend” to actual data.

    The beams are interesting patterns, for sure. And they make superb graphics! But they are not particularly relevant.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s