Democratizing Data

In my last post I promised to explain why the interactive charts included in the Institute of Museum & Library Science (IMLS) Find Your Library application are not very useful. At the time I didn’t realize that one of the charts reports erroneous data. So, this post will address that issue, too. Because I’ll be describing the IMLS app’s detailed workings, this post can also serve as a tutorial on the app’s somewhat mysterious features.

Look below! You’ll see a myriad of excerpts, tables, maps, annotations, and miscellany. Plus many charts needing deciphered, an opportunity for readers so inclined to exercise their statistical thinking skills! They’ll even be opportunities to delve into logarithmic scales, skewedness, bimodal distributions, chart drop-lines, and an obscure idea known as kernel density estimation—all within the context of the IMLS app!  (Yep, these are all found in the innocuous-seeming Find Your Library charts!)

As you may know, the purpose of the IMLS application is to provide access to the IMLS Public Libraries Survey data. But you may not know that the application is built on an open data/cloud computing software platform known as Socrata. Socrata is used by hundreds of customers in U.S. local, state, and federal governments. Here we see how this vendor envisions its open data solutions:


Socrata ideas about their software’s impact on public data. Click for larger image.

Celestial statistics! I love it! However, back on planet earth Socrata’s data presentation techniques are not what you’d call stellar. We’ll see a few of their less illuminating creations as we proceed. For now, here’s a fun software malfunction I ran across in a data table from the IMLS application:


Socrata table with garbled data. Click for larger image.

Democracy definitely means openness to assorted voices, but the table carries this idea too far.

Anyway, let’s proceed to the Find Your Library charts. These are interactive data visualizations—a map of the western hemisphere and 7 statistical charts displaying selected library measures. The visualizations are unrelated to the IMLS application’s main purpose, helping libraries find their PLS data. Rather, they are diversions intended to entice users to go on an improvisational data-digging and data-diving expedition beyond their own libraries’ data.

Ironically, the IMLS designers didn’t promote the application very well. Nor did they take the time to explain what benefits the system would provide for users. Neither did they include instructions, nor Help, nor a FAQ. As a result users are left on their own to figure out how the software works and what it does.

To see what nuggets of information the interactive charts might lead to let’s begin with the Explore Library Systems main page (accessible via the application’s main menu):


Explore Library Systems main page scrolled down to hide title area.1 Click for larger image.

Most of the Find Your Library application menu options have a main page with a title area announcing the dataset being displayed. For space reasons, in the Explore Library Systems page above I scrolled down to omit that title. The practice of conscientiously labeling data tables, charts, and visualizations is a prime statistical graphing standard. The title area announces that the data are from the PLS 2014 Public Library Data File.

As my prior post reported, use of these data visualizations is optional, yet they occupy 90% of the screen display—the map on the right and 3 charts on the left (below the app’s search window). Let’s see how the charts work. Scrolling down below the bottom edge of the page shown above leads to the last of the 7 charts on the left and, below that, a hidden table containing all 9,305 libraries from the PLS Public Library Data File:


Table showing a 11 of the 9,305 PLS libraries hidden below main page. Click for larger image.

This table’s layout matches the table with the overstruck data already shown. Although these two tables contain different data, they appear in the same location (one at a time), out of sight at the bottom of the display. Tables displayed there are limited to 11 rows which cannot be expanded. Not a particularly user friendly design decision.

This next design decision is likely the work of the Socrata team. Notice in the table that there is a floating Feedback tab smack in the middle of the vertical scroll bar. By following multiple steps, users can move the thing out of the way temporarily.2

Someone also decided not to equip the tables with the simple text search/find capability that spreadsheets and HTML pages usually have. Users can sort the data by any column, for instance, by U.S. state, which might make the data slightly more accessible (especially if users are looking for Alaskan libraries). Besides, the Explore Library Systems and Explore Library Outlet search windows, described in depth in my prior post, are meant for locating specific libraries or outlets.

The tables don’t provide a way to extract or download PLS data. Again, the IMLS Data Catalog is meant for that purpose, except it has its own usability issues (perhaps to be discussed at some later time). In the end these 11-row tables don’t have much use. Which raises the question of why Socrata designers permitted this presentation layout in the first place.

To get to the erroneous data I mentioned we need to begin with the map itself, which I invite the reader to view again (3rd image from top of this post). The map’s cryptic heading suggests a lack of enthusiasm on the part of the designers for promoting these charts. The heading is supposed to inform users that the map can display counts of public libraries for each U.S. state, one state at a time.

Unfortunately, this information appears in a small, faint font as Number of library systems by. Below this we see the large, blue-font LOCATION— States and then medium-font Address, city, state, latitude, and longitude. This last phrase is superfluous (added noise, really) since these data items cannot be retrieved or displayed using the map.

Incidentally, the map also depicts library counts using the shading scheme in the legend at the bottom right. Map shading has been in vogue in recent years, even though it is a poor conduit for quantitative information (see this post). It would be nice if the Socrata maps displayed their data as this U.S. Department of Commerce Bureau of Economic Analysis (BEA) map does:


Source: U.S. Department of Commerce Bureau of Economic Analysis (BEA) map. Click to see BEA website image.

Of course, outright presentation of data would eliminate the need for the Socrata map’s interactivity, a dull prospect for those wanting some exciting data-digging, diving, or hacking. I wish the Socrata map legend had used discrete color categories as seen in the BEA map. Gradual shading in maps is too difficult to interpret accurately. From the shading in the IMLS map, can you tell which state has more public libraries—Wyoming or Nevada?

The Find Your Library’s map legend is an example of how Socrata’s style guide ignores recognized statistical graphing practices. In standard chart design larger (major) scale intervals are rendered in a larger font, while smaller (minor) intervals are either unlabeled or labeled in a smaller font. But here we see a single numeric interval (200) labeled in both large and small fonts. No harm done, other than the microseconds users spend evaluating and then ignoring this odd visual cue. Then again, small impediments to a positive user experience eventually add up.

Let’s consider another map:


U.S. map showing PLS outlets file data. Click for larger images.

This map is from the IMLS app’s Explore Library Outlets option available from the main menu. Display-wise this option is similar to the Explore Library Systems option (which most of the screen images in this post are from). The data for this option, however, are from the PLS 2014 Public Library Outlet Data File, which contains 17,566 outlet records describing main branches, branches, and bookmobiles.

Here I want to note that the legend in the library outlets map just shown has scale intervals different from the library systems map shown earlier. This legend’s scale intervals are unevenly spaced. Beginning at the bottom (10), they increase by 46, 260, 1462, and finally 8,222 units. These types of intervals, known as logarithmic scaling, are sometimes used with specific types of data (more on this below). However, labeling axis intervals in even increments (divisible by 2, 5, 10, 100, 1K and so on) is friendlier for users. Even intervals are easier to interpret since no calculator is needed to evaluate them.

Also, statistical graphing best practices require axes and legend scales to exceed the complete range of values in the data. But in this case extending the scale to 10K is too far. The scale should stop around 1,300 to 1,500 since the highest count in the dataset is California’s 1,159 outlets.

Now we can move to the erroneous data mentioned at the beginning of this post. For this we’ll return to the Explore Library Systems U.S. map shown again below:


IMLS application’s U.S. map with pointer hovered over Maryland. Click for larger image.

This map reveals each state’s data when the user hovers the pointing device over each state, one at time. Unfortunately, the Total: 25 Library systems annotation is wrong. There are only 24 public libraries (library systems) in Maryland. Other state counts that the map reports are also wrong. For the reader’s reference this is an accurate list of library counts by state (which I created):


List of accurate U.S. public library counts by state. Click for larger image.

The map also has an undocumented feature that will help solve this mystery: Selecting (clicking on) a U.S. state invisibly retrieves detailed PLS data for the libraries in that state. Here’s how this works: First, when a state is selected, the system outlines that state in yellow, as Maryland is here:


Close-up of U.S. map with Maryland selected.

Then, hovering the device pointer over a state that has been selected causes an annotation to appear as seen in the image above. When this annotation appears in the map, Total and Filtered amounts will always be equal. However, in the app’s statistical charts these two amounts can occasionally vary. It would be nice if the Socrata system allowed programmers to list just the first of the two figures when desired. Less confusion and clutter for the user, especially without irrelevant instructions about clicking to clear the filter.

To see the 25 libraries in the IMLS application’s count we need to scroll down to the table hidden below the bottom of the display:


Table showing Maryland libraries. Click for larger image.

Notice at the top left the text Showing 25 Library systems out of 9,305, indicating that the system counted 25 libraries for Maryland. To examine this closer I sorted the table by the State column and then scrolled to the last row:


West Virginia library mistakenly included with Maryland libraries. Click for larger image.

The red oval highlights the public library in Piedmont, West Virginia, which the system has mistakenly assigned to Maryland, explaining the 25 count. Despite its detailed geocoding information (address, city, state, latitude and longitude listed in the map heading) the system was unable to determine Maryland’s boundaries accurately.

In the map the state of Maine shows this mistaken count:


Close-up of IMLS map showing Maine with 255 public libraries.

As seen in the list shown above, the true library count for Maine is 264. This next table (which I created) lists 11 Maine libraries that the application missed and 2 New Hampshire libraries that it mistakenly added:


Mistaken Maine count due to missing and mis-assigned states. Refer to PLS documentation for an explanation of FSCS Key column.

As for identifying the remaining mistaken counts hidden in the map, I defer to the IMLS programmers.

So now we can turn to the statistical graphs located to the left of the map as shown here:


Explore Library Systems main page. Click for larger image.

First, it will be helpful to describe one of the application’s peculiar behaviors—how the windows tend to move around without notice. For this I suggest we adopt a mini-vocabulary: Let’s call each rectangular window on the page, large or small, a card. And the large area where the map card currently appears, the display space (you’ll see why). Now notice that every card has a slanted double-arrow icon located in the right corner. This symbol is an expand/collapse arrow.

So, this is how the Find Your Library pages work: The cards on the page appear in one of three formats—as a search card, a map card, or a statistical chart card. For any card appearing at the left, clicking its expand/collapse arrow moves that card to the display space at the right where it replaces whatever card was in that space. Here I clicked on the on the Visits card’s expand/collapse arrow:


Visits card expanded from the left column of statistical chart cards. Click for larger image.

Now the Visits card has been enlarged and moved to the display space. This chart’s taller vertical axis unflattens the curve compared to its original shape. Notice at the left that the map is cropped (with Texas prominent) and relocated to the top in the stack of cards. Next, clicking again on the expand/collapse arrow of Visits card causes the card to move out of the display space:


Display after “collapsing” the Visits card in the display space.

For some reason, the designers arranged the cards vertically unaligned and in various sizes. Clicking the expand/collapse arrow on any card moves that card to the display space, returning the page to its original layout with the display space intact.

Now let’s look closer at the contents of the statistical charts, themselves. Note at the top of the Visits card (above) that the heading is formatted like the map’s heading. Again, it’s hard to tell which line is intended as the chart’s main title.

Socrata’s choices to ignore standard statistical graphing practices make the data difficult to see. Here the Visits chart’s vertical axis with value labels—the numeric units measured—appear at the right in a faint font. Graphs usually have vertical axes on the left. Instead, Socrata graphs rely on horizontal grid lines to mark the vertical axis, three in this case.

The most striking thing about these charts is their smooth curves, some bell-shaped like the total visits and total circulation charts. The user should beware, however, because the curves can be deceiving, a fact the IMLS designers neglected to disclose. The potential deception comes from the charts’ use of logarithmic scaling, mentioned earlier, which alters the shape of the spread of the original data.

Logarithmic scaling is typically used to fit unevenly distributed (skewed) data into a narrower chart area. U.S. public library national statistics are notoriously skewed, a characteristic the app designers should have been interested in sharing. The histograms below show this skewedness for total visits and total circulation:


Histograms of total visits and circulation data. Histograms use even scaling. Click for larger images. Click here for full interactive graphs.

In both histograms unevenness of the data is obvious from the data clustered at the left and the sparseness at the right. Also, the medians are marked there: 30,555 for visits and 37,632 for circulation, indicating that one-half of the libraries lie to the left of each median line. To see this in the app’s curves I added medians lines to two:


IMLS app’s visits and circulation charts with median values added. Click to see larger image.

Visually, logarithmic scale intervals usually appear equally spaced. But the measurement units represented are not. Reading left to right on the IMLS app’s charts, each interval is 10 times larger than the previous one. Reading right to left, each interval is 1/10th smaller than the interval to its right. Larger and larger intervals at the right of the axis compress a dataset’s high values inward, while leaving small values uncompressed. In these 2 curves this results in the peaks and medians at the center of the curves.

Again, the underlying data for these curves are not symmetrical despite how they appear. There’s also another respect in which the curves are not quite what they appear to be. Readers may be surprised to learn that the IMLS curves communicate only 5 or 6 data points per chart despite their precise shapes. (We’ll see how the system displays the data further on.) Consequently, the IMLS designers could have just as easily imparted the same data using more readable bar charts like these:

Histograms visits circ

Bar charts of total visits and total circulation report the same data as the IMLS app charts. Click for interactive charts including bar percentages.

Note that the categories in the charts’ horizontal axes match the scale intervals in the IMLS charts. Be careful, though, because logarithmic scales are continuous numeric scales while bar charts are non-numeric scales, with each bar representing a single, non-quantitative category. Although the bar labels happen to refer to quantitative ranges, horizontal axis scale intervals are merely categories, the same way that labels like very small, small, moderate, and large would be.

Here are 3 more library measures presented as bar charts with miniatures of the app’s curves underneath:


Bar charts of 3 statistical measures with Find Your Library curves shown beneath. Click chart for larger image. Click here for interactive bar charts including bar percentages.

Interesting, huh? We’ll examine the eBooks chart’s bi-modal distribution further on. Just scanning these charts quickly, it’s obvious that the bar charts show the data more clearly. And make it easier to compare the relative sizes of each segment. These tasks are chores with the IMLS charts although perhaps not to enterprising data-diggers! (In a couple of paragraphs we’ll see how user interaction is required to get the data values displayed.)

Socrata’s statistical curves do appear more, let’s say, scientific than the bar charts do. The Socrata charts are most likely calculated using kernel density estimation. Very briefly, this is a method for extrapolating from actual data points to get a smoother, more precision distribution than histograms. Since this method extrapolates from the data, it will not necessarily represent them exactly.

So now that we have some idea of what the curves are, we can look at the charts’ interactivity. Hovering the device pointer anywhere above a chart’s horizontal axis causes the segment of the curve directly below to be highlighted in pale white. Here’s how the highlighting looks in two renditions of the total visits chart, one for 1,000 to 10K and 10K to 100K:


Hovering pointer anywhere above chart axis reveals segments’ library counts. Click for larger image.

The annotations just report the number of libraries falling within each segment’s range. It would be nice to tell users a little more, for instance, the percentage of total libraries that each number represents. This next image shows the same charts which I adapted to improve their readability, as noted in the caption:


IMLS app’s total visits charts adapted to include vertical axes lines, numeric scales, and drop-lines. Click for larger image.

The horizontal lines I added to these charts are drop-lines. A drop-line matches up a single data point’s location with one of the chart’s axes to show precisely what axis value the point lines up with. The annotations visible in all four of these charts mark the midpoints of the segments. In the 2 charts just above the drop-lines connect these midpoints to the vertical axes. There is no need for vertical drop-lines because the midpoints of the curves, with its associated library count, represent the entire segments, the ranges 1,000 to 10K and 10K to 100K.

When the user clicks on a segment of the curve in a statistical chart, rather than just hovering over it, two parallel, dotted lines outline the segment as seen here:


Total visits chart with center segmented selected. Click for larger image.

And, obviously, the segment between the parallel lines is now decked out in mustard yellow! In the dark band across the top of the page notice the phrase Showing Library systems where Visits is 10K to 100K. Selecting the segment invisibly retrieved the public libraries falling within its range, just as selecting a state in the U.S map retrieved that state’s libraries. Scrolling down as we did with the tables earlier, we can see the resulting 11-row table:


Table showing top rows of public libraries reporting total visits from 10K to 100K. Click for larger image.

At the top left corner of this table, above the green row of column headings, notice the small text Showing 4,216 Library systems out of 9,305. This matches the annotated count on the chart. Looking again at the chart, we can get the annotation to reappear by hovering the device pointer over the selected segment:


Hovering over selected segment reveals annotation and fades segment coloring. Click for larger image.

Again, the annotation contains unnecessary information. But let’s move on to see more interactivity: By dragging the tab at the top of either vertical dotted line, the curve segment size can be extended. Take a look at these chart images:

Before and after dragging dotted line leftward one segment. Click for larger image.

The image on the left shows the chart before I dragged the tab on the left vertical line leftward from the 10K mark on the horizontal axis to the 1,000 mark. The right chart shows the result. The original 10K to 100K segment extended one axis interval leftward, making the new segment span from 1,000 to 100K. The small yellow funnel at the left of each chart’s highlighted axis label means the PLS data have been filtered to contain only libraries in the selected segment(s).

Now here’s an interesting feature of these charts: In this next image, which has the 10K to 100K (joint) segment still selected, the small charts at the left now have blue-filled curves superimposed on the original curves. The original curves, which had been blue, are now gray:


Selecting (filtering) segments adds 2nd blue distribution curves to smaller charts. Click for larger image.

The system plotted the blue curve to depict how libraries with total annual visits from 10K to 100K are distributed across total circulation segment ranges (the horizontal axes). And the same for total programs. Pretty sophisticated, huh? We can expand one of the small charts to look closer:


Circulation chart expanded and moved to display space with visits 1,000 to 100K range selected in total visits chart. Click for larger image.

Expanding the total circulation chart puts it in the display space. Notice in the dark band across the top of the page the text Showing library systems where visits is 1,000 to 100K. This confirms that the PLS data have been filtered (pared down to a subset of the total 9,305 libraries in the datafile). And the visibility of the yellow segment on the visits card at the left reiterates the filter’s definition.

So, what would we learn from studying how libraries falling within a given visits range (like 1,000 to 100K) are distributed on total circulation? Well, we learn something already well understood in library statistics: visits and circulation go together. Many library statistics depend a lot on the size of the library, and besides funding, visits and circulation are at the top of this list. Pick a library in a smaller total visits range and that library is very likely to fall within a smaller total circulation range.

The same goes for libraries with medium, large, and very large visits and circulation totals. The above chart illustrates this. The blue curve spans the circulation scale’s 1,000 to 100K range with some extending higher into the 100K to 1M range, since libraries almost always report larger circulation totals than visits.

Users can also drag dotted lines in the small charts located at the left of the page. So, I extended the selected segment in the visits chart at the left to see how this affects the circulation chart:


Circulation chart with 1,000 to 1M range selected in visits chart. Click for larger image.

There’s almost no difference between the spread of the two distributions, is there? When you’ve selected libraries with visits from 1,000 to 1000K (1 million), you’ve selected almost all of the libraries with circulation falling in that same range. In the circulation chart, the unselected libraries from 1,000 to 1 million are depicted by the tiny gray slivers visible on both sides of the curve from the 2,000 mark downward to about 1,600. (Regrettably, there are no value labels or tick-marks on the vertical axis to gauge this level by.)

This scatter plot shows this same relationship between these 2 measures:


Scatter plot indicating that total circulation and total visit counts are directly related. Click for larger interactive image.

Pretty straightforward. As visits rise, circulation rises and vice versa. The blue over gray curves in the Find Your Library statistical chart tells us this also, but it’s a more obscure message to understand. Most readers will not make the connection. If this is the message IMLS designers want to get across, scatter plots would probably work better.

Now let’s try another Find Your Library statistical chart, eBooks held by U.S. public libraries:


eBooks chart expanded and moved to display space with total visits chart 1,000 to 100K range selected. Click for larger image.

The first thing to notice is that the blue and gray curves are both bimodal distributions. Libraries are clustered around two ranges on the horizontal scale, rather than one. Looking at the underlying data in a different way (always a good strategy when trying to understand data) will help explain this. Here are the PLS eBooks data without logarithmic scaling:

Histogram of eBooks data. Click chart for larger image. Click here for interactive chart including bar percentages .

As noted at the left of the histogram, the first bin (bar) in the histogram includes the 2,100 public libraries that reported zero eBooks held. As a comparison, here I re-show just one of the set of 3 bar charts shown earlier, the chart for eBooks:


Distribution of eBooks shown as a bar chart. Click for larger image.

This chart uses a single, separate category for the 2,100 public libraries that reported zero eBooks. (2,100 represents 23% of all libraries in the 2014 PLS datafile.) As I said, since the IMLS eBooks chart includes these libraries in its first scale interval (0 to 10), the peak in the curve ends up centered at 5, the midpoint of this range. If the chart had omitted libraries reporting zero eBook holdings, the curve would have had just a single peak along the lines of the pattern in the 1K to 10K and 10K to 100K bars in this bar chart.

Now let’s consider a scatter plot:


Scatter plot indicating no direct relationship between eBooks and Visits counts. Click for larger and interactive image.

The relationship between a library’s total visits and eBooks is quite different from that between circulation and visits (see the green scatter plot shown earlier). From this  scatter plot it is clear that there is no direct relationship between eBooks and visits. Low-visits libraries (1,000 or so) can have relatively high eBook counts (40K or so). And high visit libraries (around 1,000,000) can have few or no eBooks reported.

Incidentally, the dots that form parallel lines at about 60,000 and higher on the eBooks scale (and also around 20,000) are, for the most part, libraries reporting the same eBooks count, probably a standard total available via a state-wide eBook service. Again, another reason for the lack of relationship between library size (funding, visits, circulation) and eBook holdings: Every library in the state has the same eBooks count!

So, now you’ve seen the interactive features of the Find Your Library statistical charts. As to how informative the charts are, I’d say they head in the right direction. They do encourage users to consider taking time to view and analyze national library statistics. But I’m afraid they don’t offer options necessary for understanding the data patterns. And that, alone, the charts don’t tell us very much.

There’s a limit to what we can learn by studying how libraries are distributed on various input or output measures. National library data become more understandable when examined in conjunction with key variables like library size, population, geographic area, rural, suburban, or urban settings, and so on. Here are some examples:


Library output measures by population categories. Click to see larger and interactive image.

Seeing how levels of visits are spread according to population tells us what share of national levels can be attributed to which size libraries. The same for these other library measures:


Other library measures by population categories. Click to see larger and interactive image.

One thing apparent from the 5 bar charts is this: Libraries in the largest population areas are responsible for the largest portion of usage statistics like circulation and visits. But libraries in the smaller to medium population ranges (25K to 99.9K) keep pace with larger libraries as far as availability of materials go. The 3 categories from 25K to 99.8K each surpasses 2 of the 3 top categories, and rival the top category (1M or more). The astute analyst will dig deeper by seeking out the number of libraries in each population category.

As you can see, data are not democratized by merely making them visible on a page. They are democratized by making them understandable and usable. And by providing tools for analyzing them in alternative ways.


1 The appearance and sizing of the Find Your Library app’s pages differ depending on the device and browser used. Images shown in this post are from my desktop’s 23” screen using FireFox. A few images are cropped at the bottom edges to save space.
2 One solution is clicking on Feedback, saying NO to the next option, and opening and then closing the Feedback window. That moves the Feedback button to the bottom, out of the way. At least for a while.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s