Skyrocket Science

I am still on my excellence-in-graphical-data-presentation kick. Insufferably so, I am afraid. As my prior post mentioned, the principles of high quality graphical data presentation have been articulated by William Cleveland, Edward Tufte, Howard Wainer and others. Good graphing practice is based on these three rules:

Be clear.  (Strive for clarity. – William Cleveland)
Be fair and accurate.  (Tell the truth about the data. – Edward Tufte)
Be thorough.  (You can see a lot just by looking. – Howard Wainer quoting Yogi Berra)

When I got a copy of the new study conducted by the University of Washington, Opportunity for All:  How the Public Benefits from Internet Access at U.S. Libraries, I went straight to the pictures, of course! A graph from Chapter 2 (shown below) seemed like a candidate for fine-tuning using the three rules. The graph is intended to illustrate a central theme of the study: In U.S. public libraries over the past decade (a) the delivery of public Internet access services has grown immensely, (b) utilization of public library services in general has grown healthily, though not as dramatically, and (c) overall library operating resources have increased only modestly.

But, first, I have an assignment for you. It is always helpful to gather some background information about the data used in any chart. The chart below portrays five library statistical indicators which I have charted here. (Check out charts 1A through 5. Don’t cheat. Note that data for 2007 are included in the charts.)

Source: Becker, S. et al., 2010, Opportunity for All1

According to its title, the U. of Washington study graph portrays change in library use (visits and circulation) and resources (terminals, hours open, and librarians) from 1998 to 2006. The chart’s layout is a bit unusual because it uses two vertical axes, each having its own unit of measure (called a scale in graphing jargon). The left axis, labeled Percent change, represents rate of change in the four statistics listed next to the symbols in the chart legend. The right axis, labeled Number of terminals, serves as a gauge for average number of public Internet terminals per library outlet, the first item listed in the chart legend.

The chart’s horizontal axis lists years. Data for 1998 are excluded since that is the base year for tracking percent change for the four statistics represented by the lines. If 1998 were inserted to the left of 1999 and the four lines were extended leftward, they would converge at zero. For the fifth statistical indicator, public Internet terminals, the 1998 value would be a count instead. So that you’ll be aware of what these baseline data really are, I list them in this box (rounded to the nearest thousand):

iWash_98baseline_380

Note in the (green) box that the baseline data for public Internet terminals, hours open, and librarians are actual counts rather than averages per library outlet. The per outlet averages in the U. of Washington chart don’t really enhance our understanding of the trends. In my charts introduced above take another look at 1A and 1B, 2A and 2B, and 3A and 3B. Note that year to year changes are pretty much the same whether viewing the statistical measures themselves or their per outlet averages. This is because the number of U.S. library outlets has been relatively stable, increasing by only 2% in nine years (1998 through 2007) as seen in chart 5b in the charts. Thus, the U of Washington researchers need t graph just the rates of change of the measures alone, as they did for visits and circulation. A minor point, I know. But omitting the per outlet averages will shorten the legend labels and narrative text. And it’s one less detail that readers have to keep track of.

The chart’s most striking feature is the beige-shaded bars that resemble an uneven picket fence connected horizontally by colored wires. Besides having two different axes, the chart is a hybrid of two styles: a bar chart representing the public Internet terminals data and a line chart plotting values for the remaining data—visits, hours open, circulation, and librarians. The researchers combined these two chart styles to emphasize the public Internet terminals data. Unfortunately, this causes the rest of the data to be more difficult to decode, using William Cleveland’s term for visually interpreting a chart’s symbols, text, and arrangement.

First, the vertical bars constrain the left-to-right flow of the trend lines. By comparison, the flow of the four lines as I re-plotted them here is unfettered. Then, because the line symbols (squares, triangles, asterisks, and dots) end up enclosed in the slats of the picket fence, the bars appear as if they could be the vertical scale for the lines. The scale for the lines is the left axis (percent change). Topped off by rounded figures (5, 6, 7, 9, 9 and so on), the bars tower over the lines due to the shorter span of the right axis (14 units) compared to the left axis (40 units). This gives the false impression that values depicted by the lines for 2002 to 2006 are smaller than the numbers above the bars.

In other words, the prominence of the bars causes the right axis units to dominate the chart. Yet, that scale applies to only 20% of the values plotted in the chart, while the left axis scale applies to 80%. The design of a chart using two non-corresponding scales should assure that data points can be judged according to the appropriate scale (axis). In this chart, data plotted for the public Internet terminals need toned down somehow so that the other statistics can be accurately evaluated. Annotating the legend labels to indicate which axis applies to which statistics might also help.

The basic problem, though, is plotting two types of data together that are not really comparable: rates of change for hours open, librarians, visits, and circulation and actual counts of terminals. This is (roughly) like tracking changes in the heights of children in, say, five groups. For groups #1 through #4 we record annual growth as a percent change from the prior year. But for group #5 we record actual growth in inches. A child from one of the four groups whose height increased 3% appears (mistakenly) on a par with a child in group #5 who grew 3 inches.

Mixing scales like this can lead to incorrect conclusions. The U. of Washington chart designers did carefully label the left axis with percent signs. And they differentiated the two types of data, representing terminal counts as bars and rates data as lines. But the end result is data overlapped in a way that invites readers to make inappropriate comparisons.

Let’s consider the rate calculations alone. Researchers can choose different ways to portray rates of change in data. Typically, rates are calculated periodically. For example, the rate of economic inflation is calculated annually using the prior year as a base for the current year. This is called an annual rate. The U. of Washington report happens to use a cumulative rate which is a periodic rate summed over a range of periods—from 1998 to 2006 in this case. It would be helpful, then, for the left axis to read cumulative percent change to remind readers that this type of rate is being used.

Demonstrating cumulative progress in rolling out public Internet terminals in U.S. public libraries is important, of course. Nevertheless, comparing productivity rates for the startup phases of a project with later phases introduces certain problems. The report states that the study chart shows the average number of Internet terminals to have grown by more than 300 percent from 1998 through 2006.2  The chart, though, doesn’t actually contain Internet terminals growth rate data. In any case, the narrative declares that Internet access services in U.S. libraries have skyrocketed, a characterization based, at least in part, on this extraordinary percentage.3

1998 was the first year that statistics about public Internet terminals were collected by the National Center for Educational Statistics. That year about 57% of libraries submitting their annual data reported this new statistical item. So, the 1998 count—24,088 for the 50 U.S. states and the District of Columbia—is quite understated. In 1999, when 96% of reporting libraries did report this item, the count of public Internet terminals jumped to 69,427—a 188% increase. From 1999 through 2007 the cumulative rate of growth in public Internet terminals was 122%. Take a look at the trajectories traced by the yellow-star and green-star lines in the chart below. The propellant for the 300%-plus figure at the far right of the yellow-star line turns out to be under-reported national data in 1998. Clearly, the green-star line is a fairer representation of cumulative rate of growth in Internet terminals over the decade.

Cumulative growth in public terminals for base year 1998 (yellow) versus 1999 (green).  Click chart for larger image including chart legend.
Data source: IMLS Public Libraries in the United States Survey

Choosing a less biased startup year still is not the best way to gauge progress of a large project like the provision of Internet access in public libraries nationwide. A new project of this sort will always have high growth rates in its earliest phases. As the project proceeds, these high rates cannot be maintained. The chart below compares annual rates of change (not cumulative rates) for the four statistics used in U. of Washington study chart. Note that these four vary between -2.2% to +6%.

Annual rates of change in selected resource and use measures.  Click chart for larger image including chart legend.
Data source: IMLS Public Libraries in the United States Survey

This next chart shows the same data (flattened due to the extended left axis scale) along with annual rate of change for the terminals data. This rate peaked at 188% in 1999 and decreased precipitously to about 6% by 2007.

Annual rates of change in public Internet terminal counts and other library measures.  Click chart for larger image including chart legend.
Data source: IMLS Public Libraries in the United States Survey

A better measure for evaluating progress in public Internet terminal installations is terminal counts, since these indicate actual levels of work accomplished. This is one reason the terminals data on the study chart are more easily interpreted than the rates data. Another is that the data units (terminal counts) give us a more immediate sense of the quantities involved. Evaluating the rates data requires looking up baseline data (shown in the green box above) and considering the different units and orders of magnitude. (How should we compare 6% of 190,000 terminals with 6% of two billion circulation transactions?)

Besides examining average number of terminals, looking at total counts, ranges, percentiles, and other descriptive statistics increases our understanding of the data. The distributions of terminal counts by year illustrated in statistical graphs called boxplots appearing in charts 8A and 8B. Chart 8B contains a table showing total counts, minimums, maximums, quartiles, and so on. From chart 8A we see that some libraries reported very high counts that cause the annual averages to be artificially high, whether calculated per library or per library outlet. Medians would be more accurate summary measures.

As for examining trends in library resource utilization, expenditure data are more telling indicators than counts of librarians or library hours. For the sake of fairness, multi-year expenditures need to be adjusted for inflation. Chart 9A below gives total operating, staffing, and collections expenditures converted into 2007 dollars (left axis).4 When adjusted for inflation, both total operating and staffing expenditures by U.S. public libraries have increased moderately from 1998 to 2007, while collections expenditures increased slightly. Inflation-adjusted operating expenditures increased at an average annual rate of 3.1%, staffing expenditures at 3.4%, and collection expenditures at 1.4%. Of the three expenditure types, only collections involved actual spending cuts (negative growth) during the time period as seen in Chart 9B below.

Trends in library inflation-adjusted expenditures.  Click chart for larger image including chart legend.
Data source: IMLS Public Libraries in the United States Survey

Trends in rate of change in library inflation-adjusted expenditures.  Click chart for larger image including chart legend.
Data source: IMLS Public Libraries in the United States Survey

Note in the upper chart above that inflation-adjusted collection expenditures were nearly flat while operating and staffing expenditures increased moderately. In the lower chart, annual growth in inflation-adjusted operating, staffing, and collection expenditures can be seen to have declined from 2002 to 2004. Only collection expenditures were cut those years.

Longer-term trends in public library expenditures, income, and budgets have been examined by Bob Molyneux in his article Squeeze Play: Public Library Circulation and Budget Trends, FY1992-FY2004 in Public Library Quarterly, 26:3/4, March 2008.

 
—————————

1  Becker, S., Crandall, M. D., Fisher, K. E. et al. 2010. Opportunity for All: How the American Public Benefits from Internet Access at U.S. Libraries. (IMLS-2010-RES-01) Washington, D.C: Institute of Museum and Library Services. p. 18. The study was funded jointly by IMLS and the Bill & Melinda Gates Foundation.
2  Becker, S. et al. p. 17.
3  Becker, S. et al. p. 18.
4  2007 dollar values were calculated using the GDP Deflator method from www.measuringworth.com.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s