Data Detour

Nowadays libraries aspire to be data-driven. Almost everyone agrees that collecting and using data to improve organizational performance is a good thing. Implied in the various regimens promoting this idea (library assessment, managing-for-results, evidence-based practice, quality management, etc.) is the need for practicing two virtues: patience and determination. These virtues happen also to be part and parcel of information literacy. We repeatedly advise users to avoid settling for the most convenient and quickly accessible information that shows up. And urge them to put ample time and energy into thinking critically about their question, its context, and the complete range of potentially relevant information.

Being data-driven requires this same discipline. I mention discipline because this blog entry concerns a challenging but quite educational quantitative topic. If you follow this to its conclusion I guarantee your numeracy muscles will be invigoratingly exercised!

I begin with this caveat: When advocacy or public relations professionals fail to think critically about their data, they risk communicating the wrong message to their audiences. Last month I ran across an interesting case of this in a radio public service announcement by the Ohio Highway Patrol. The radio ad promoted automobile seat belt use based on this statistic: Of all traffic fatalities, 62% of victims were not wearing seat belts.1

This statistic surprised me because it didn’t make seat belts seem like that great of an advantage. Especially when you consider the common notion of the 50/50 chance that people apply in everyday circumstances. Very often we figure the chance of some future outcome as, “Either it will happen or it won’t.” Radio listeners with this same mindset are liable to think, “What’s the use of wearing seat belts if I still have a 35% or 40% chance of being killed?” They won’t realize that both they and the Ohio Highway Patrol are calculating the risks incorrectly.

Risk calculation, i.e. probability, is the challenging quantitative topic I want to discuss here. I use this particular topic to illustrate a more general idea: When we are fair-minded and thorough in our analyses, the data actually do a good amount of the driving for us. They definitely steer things in certain directions, towards certain conclusions or away from others.

In the radio ad example the key question is, “By how much do seat belts lower the risk that vehicle occupants will be killed in traffic accidents?” The 62% and 38% figures don’t answer this. Instead, they answer the question, “When fatal accidents occur, how likely is it that victims were or were not wearing seat belts?”

The percentages in the ad are a special type of calculated chance known as a conditional probability. Here’s how it works: Given the fact (condition) that an individual was killed in an accident—something we know with 100% certainty—the liklihood that that individual was not wearing a seat belt is 62%. This statement is more understandable rephrased as:  Among all traffic fatalities 62% of victims were not wearing seat belts. Our question, though, concerns the risks that exist before highway travelers suffer the misfortune of being accident victims. Addressing this question leads to some convoluted quantitative ideas. So this is where patience and determination come in.

A classic example of conditional probability discussed in statistics courses will help GigerenzerCalcRisksCover90illuminate the highway fatality statistics. This example, described in psychologist Gerd Gigerenzer’s book, Calculated Risks, involves a common misconception about the reliability of medical laboratory tests.2 When a medical test is, say, 95% reliable and a given patient’s test is positive, the result is commonly interpreted to mean that the patient has a 95% chance of having the disease. This is wrong.

Like the highway accident statistics, here 95% is a conditional probability. Among the entire population of people who actually—and with 100% certainty—have a given disease, a particular medical test will detect 95% of the cases and will miss 5%. Note that these percentages don’t tell us the likelihood that a given patient actually belongs to the diseased population.3

Think of it this way: Assume that two medical tests are both 95% reliable. One is designed to detect an extremely rare tropical disease, and the other any of several common cold viruses. You realize intuitively that the 95% cannot possibly define the risk of having either of these diseases. On any given day many more patients infected with cold viruses are likely to seek medical care than patients with rare tropical diseases. So, somehow the prevalence of a disease in the larger population must be part of any calculation of the chances that a positive laboratory test is true.

The same goes for the radio ad data. They do not tell us whether a given vehicle occupant will, or will not, fall into the group of individuals involved in fatal highway accidents in the first place. This is where US highway statistics are useful, such as some I found for 2010.4 In rounded figures, in the US in 2010 there were 33,000 highway fatalities among 214 million total drivers. Thus, the chances of being in a fatal vehicle accident were 33,000 / 214,000,000 which is 0.0000154 or about 15 deaths for every 100,000 drivers.5 (Since I only found data about vehicle drivers, I’m treating all vehicle occupants as if they were drivers. While this will overstate the true risks, it will allow us to step through the calculations.)

Next, for guidance on how to combine the chances calculated in the prior paragraph with the conditional probabilities in the radio ad, we can rely on something known as Bayes Theorem. This statistical theorem identifies two other related probabilities: Among all US drivers in 2010 how frequent were (1) accidents where vehicle occupants were killed while wearing seat belts; and (2) accidents where vehicle occupants were killed while not wearing seat belts? These sound exactly like the 62% and 38% figures, don’t they? But they are not because these are derived from the complete pool of US drivers in 2010, not just from the 33,000 fatal accidents.

The two probabilities just described should answer our original question. Following Bayes theorem, we calculate these by multiplying the radio ad percentages (62% and 38%) by the overall prevalence of fatal accidents, that is, by 15 deaths per 100,000 drivers. Here are the results:

(a)  9.5 fatalities per 100,000 US drivers where seat belts were not worn
(b)  5.8 fatalities per 100,000 US drivers where seat belts were worn

These are the risks of having been a driver in a fatal accident without or with wearing seat belts in the US in 2010. (Again, to be accurate we need the count for all vehicle occupants rather than just drivers.) I follow Gerd Gigerenzer’s advice to present these as fatalities per 100,000 drivers—what he calls natural frequencies—because these are more self-explanatory than probabilities (0.0000954 and 0.00005837) or percents (0.00954% and 0.00584%). Clearly, the chances of being in a fatal accident in the US in 2010 with or without seat belts were quite small—about 15 in 100,000. And the unique risks associated with having worn or not worn seat belts appear in (a) and (b) above.

Unfortunately, there is a glaring problem here. The risks associated with wearing seat belts or not are both lower than the overall risk, 15 in 100,000. For instance, the risk associated with not wearing a seat belt is only 9.5 in 100,000. Results like these won’t be explainable to the general public, even though they are technically correct.6 The main problem is that the original radio ad made the wrong comparison. Even with the probabilities correctly stated, comparing seat belt usage among victims of fatal traffic accidents is (pardon the term) a quantitative dead end.

There’s an alternative comparison that makes a stronger case for wearing seat belts. First, for argument’s sake, we need to assume that seat belts do nothing to prevent traffic fatalities—i.e., they make no difference in the outcomes of accidents. Then we incorporate the following additional information from the larger population of all US drivers: According to recent estimates 85% of drivers consistently wear seat belts and 15% do not.7  If the assumption that seat belts make no difference were true, we’d expect that the group of 2010 traffic fatalities would consist of 85% seat belt wearers and 15% non-wearers.

This (hypothetical) scenario is illustrated in the chart below. Numbers above the left set of bars, labeled Expected Fatality Counts If Seat Belt Usage Made No Difference, are the 33,000 fatal accidents multiplied by 85% and 15% respectively. Numbers above the right set of bars, labeled Actual Fatality Counts, are 33,000 multiplied by the radio ad

ExpVsActualSeatBeltData380

Click for larger image

percentages, 38% and 62%. Now let’s compare these numbers beginning with the bars at the left of the chart. Assuming that seat belts made no difference in the outcome of accidents, we’d expect fatalities for wearers to be 23,000 more numerous than non-wearers. However, in the actual counts on the right side we see that seat belt non-wearers outnumbered wearers by 7,000. We can also compare expected versus actual counts for each seat belt group. For seat belt wearers the actual count is 15,000 less than the expected count; and for non-wearers the actual count is 15,000 higher than the expected count.

Granted, the Ohio Highway Patrol radio ad did present the same basic message: Seat belt wearers are less frequent in fatal traffic accidents. Or, if you like, non-wearers are more frequent. But the ad could not really say how under- or over-represented either group was. This omission caused the ad to understate the effect of wearing seat belts. On the other hand, comparisons from the above chart amplify the intended message without misrepresenting any risks.

I leave the interesting task of adapting the information from the chart for the general public to the experts in marketing communications and public relations. Somehow they’ll need to deal with the facts that (a) the comparisons made in the chart are not estimates of risk and (b) the comparisons do not tell us anything about the number of lives saved due to seat belt use.

In the meantime I’d like to conclude with a statistical riddle for the reader’s entertainment. The solution to the riddle demonstrates why the true effectiveness of seat belts in preventing traffic fatalities cannot be measured using traffic fatality data. Here’s the riddle:8

In 1835 physician H. C. Lombard surveyed death certificates over a half-century in Geneva, Switzerland to compare the life expectancies of various professions. He found that the majority of professions had a mean life expectancy between 50 and 60 years. The most dangerous (i.e., short-lived) profession turned out to be the profession ‘students’. That group’s mean life expectancy was 20.7 years. What explains this curious result?

—————————
1   I believe the statistics were for the US rather than Ohio only. According to the radio ad link above the data were from 2011.
2   Oxford University statistician Peter Donnelly also addressed this misconception in this TED lecture, at the 11:00 minute mark. So did Charles Seife in the appendix to his book, Proofiness. Because this misconception shows up in court cases, it has become known as the prosecuter’s fallacy. See especially the disturbing case of Sally Clark who was imprisoned for murder of two children who died from sudden infant death syndrome, described in Lawrence J. Hubert & Howard Wainer, A Statistical Guide for the Ethically Perplexed, 2013, 19-20, 49.
3   This example has an extra complication that, fortunately, we can ignore here: The laboratory tests also produce false positive results where the disease is detected in people who do not actually have it. These false positives also need to be taken into account when calculating the chances of actually having the disease or not when a test result is positive.
4   Total highway fatalities in 2010 were 32,885. See Traffic Safety Facts 2010.
5   If the data were available, it would be nice to count the number of highway trips rather than drivers.
6   Due to the mathematical rules which probability follows, assessing the risk of subgroups from a larger group always results in subgroups having risks smaller than the larger group.
7   See Traffic Safety Facts Research Note: Seat Belt Use in 2010-Overall Results, National Highway Traffic Safety Administration, Sept. 2010 (DOT HS 811 378).
8   I found this story in Howard Wainer, Graphic Discovery: A Trout in the Milk and Other Visual Adventures, 2005, 143-144.

7 thoughts on “Data Detour

  1. Interesting article… I once read somewhere that smokers are statistically less likely to die from age related illness. Hmm
    Students as an occupation would be over represented by younger people, other careers may be over represented by older people..

  2. Thanks Catherine. What you heard makes sense: Over time since smokers generally die earlier there are fewer left to die from age-related diseases later. As far as the riddle goes representation–i.e., sampling–is the key thing. I won’t say anymore now but you’re on the right track, though it’s not so much that young people are over-represented among students.,,

    1. Hmm i understand that sampling needs to be done in a way to represent the larger population.. so is it to do with the professions what proportion of the professions are students, what porportion architects, etc?

  3. Your persistence forces me to answer, Catherine. (I enjoy it!) Attempting to assess the longevity of people from various professions by looking at only those who are already deceased is a biased, (i.e. non-random, non-representative) sample. This sample is relevant to the question, “Among deceased persons in Switzerland that year, what profession has the lowest average age of death?” But the research question is, “Among all professions in Switzerland, which profession has the lowest average life expectancy?” Answering this requires surveying a representative sample and tracking them longitudinally (over perhaps scores of years). And recording age of death and profession for each survey subject as they occur.

    The problem with taking a single ‘snapshot’, i.e., counting just death certificates at one point in time, is that professionals currently alive are omitted from the sample. Their longevity cannot be accounted for in such a study since they’re still alive! Consider this: Say being a sheriff is an especially dangerous occupation and the average age of death for sheriffs was 35 years based on death certificates. Well, what about the sheriffs, some younger, some older, who are still alive and who may live decades longer than this average? Their good fortune to-date has not been calculated into the study. Yet these alive-and-kicking sheriffs–and all other undead persons representing all possible professions–are a major component of the population we hope to generalize to.

    The bias of the deceased group is also fairly apparent when we consider that it is composed of (a) aged people of various professions who died more or less naturally, (b) people whose professions exposed them to a level of danger that led to their untimely deaths, and (c) others who died early due to extraneous causes (like the students). The professional breakdown among the oldest group might be a reasonable ballpark estimate for the mostly safe occupations from the larger population. But, as I already explained, the second group is an incomplete sample from a more dangerous occupation. Also notice that there will be systematic under-representation of younger people from dangerous and safer professions among the death certificates, compared to the prevalence of these groups within the larger population.

  4. Thanks RayFor taking the time to explain
    Its all about looking at what question is being answered and I was still too focused in on the data …
    I’m just a learner at stats ,, and I look after many of my library’s stat spreadsheets.

    I wonder if I can if I can suggest a scenario that is puzzling me at the moment?

    My library has one very large branch and 3 tiny. We are looking at average desk shifts per staff member.
    Can we compare desk shifts across such different branches ?
    We tend to look at average hours of desk work as if desk work at a tiny site with a small number of interactions is the same as at the larger site with lots of patron interactions, more complex queries, handling queues, troubleshooting etc,

    Any thoughts.?

    1. Catherine – Sorry for my delayed answer. Your question is more complex than you might think, as it applies to measurement in all kinds of situations, in both private industry and public and not-for-profit organizations. It sounds like you’re interested in tracking staff hours spent at reference desks, is it? That is different from workload/customer demand at any given desk, though the two will be related.

      In either case—examining either (a) how staff time at small branches is distributed compared to time at large branches or (b) how workload/customer demand is distributed at small versus large branches—there is at least one key issue you need to address: Your comparisons may well be apple and oranges.

      Thinking of how benchmark comparisons are done, it’s always important that the entities compared be as similar as possible. It’s fine to compare workloads, including the variability of these, at two similarly-sized branches. But making comparisons at one large branch compared to smaller ones is riskier, mainly due to the scale of the numbers involved. This topic is rather big to cover here. But I’ll offer one example. Say you want to compare percentages of all customer requests occur during a morning versus an afternoon one. In the small locations, a few additional or less customers can account for a big percentage change, whereas at the main branch they will not. Perhaps you get the idea. So you have to tread carefully so you don’t make inappropriate comparisons.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s