About 45 years ago, I spent a whopping $1.95 on a little book titled "How to Lie with Statistics."
Besides the catchy title, its bright orange cover has a comic
character sweeping numbers under a rug. Darrell
Huff, a magazine editor and a freelance writer, wrote the book in 1954. It went on to become the most popular
statistics book in the world for more than half a century. A translated
version was published in China around 2002.
It takes only a few hours to read the entire book of about 140
pages and 80 pictures leisurely, but it was a major reason why I pursued an
education and a professional career in statistics.
The
corners of the book are now worn; the pages have turned yellow. One can identify some of the social changes in
the last 60 years from the book. For
example, $25,000 is no longer an enviable annual salary; few of today’s younger
generation may know what a “telegram” was; “gay” has a very different meaning
now; and “African Americans” has replaced “Negroes” in daily usage. As indicative of the bygone era, the image of
a cigar, a cigarette, or a pipe appeared in at least one out of every five pictures
in the book – even babies were puffing away in high chairs. The word “computer” did not show up once among
its 26,000 words.
Huff’s
words were simple, but sharp and direct.
He provided example after example
that the most respected magazines and newspapers of his time lie with
statistics, just like the dreadful “advertising man” and politician.
According
to Huff, most humans have “a bias to favor, a point to prove, and an axe to
grind.” They tend to over- or
under-state the truth in responding to surveys; those who complete surveys are
systematically different from those who do not respond; and built-in partiality
occurs in the wording of a questionnaire, appearance of an interviewer, or interpretation
of the results.
There were no desktop computers or mobile devices;
statistical charts and infographics were drawn by hand; data collection,
especially complete counts like a census, was difficult and costly. Huff conjectured, and the statistics
profession has also concurred, that the only reliable small sample is one that
is random and representative where all sources of bias have been removed.
Calling anyone a liar was harsh then, and it still is now. The dictionary definition of a lie is a false statement made with deliberate intent to deceive. Huff considered lying to include chicanery, distortion, manipulation, omission, and trickery; ignorance and incompetence were only excuses for not recognizing them as lies. One may also lie by selectively using a mean, a median, or a mode to mislead readers although all of them are correct as an average.
Calling anyone a liar was harsh then, and it still is now. The dictionary definition of a lie is a false statement made with deliberate intent to deceive. Huff considered lying to include chicanery, distortion, manipulation, omission, and trickery; ignorance and incompetence were only excuses for not recognizing them as lies. One may also lie by selectively using a mean, a median, or a mode to mislead readers although all of them are correct as an average.
No
matter how broadly or narrowly lies may be defined, it cannot be denied that people
do lie with statistics every day. To some
media’s credit, there are now fact-checkers who regularly examine stories or
statements, most of them based on numbers, and evaluate their degree of truthfulness.
In
the era of Big Data, lies occur in higher velocity with bigger volume and
greater variety.
Moore’s
law is not a legal, physical, or natural law, but a loosely-fitted regression
equation in logarithmic scale. Each of
us has probably won the Nigerian lottery or its variations via email at least a
few times. While measures for gross
domestic products or pollution are becoming more accurate because of Big Data,
nations liberally use their aggregate or per capita average, depending on which
favors their point of view.
Heavy
mining of satellite, radar, audio messages, sensor, and other Big Data may one
day solve the tragic mystery of Malaysian Flight MH370, but the many pure
speculations, conspiracy theories, accusations of wrongdoing, and irresponsible
lies quoting these data have mercilessly added anguish and misery to the
families of the passengers and the crew.
No one seems to be tracking the velocity, volume and variety of the false
positives that have been generated for this event, or other data mining efforts
with Big Data.
The
responsibility is of course not on the data; it is on the people. There is the old saying that “figures don’t
lie, but liars figure.” Big Data – in terms
of advancing technology and availability of some massive amount of randomly and
non-randomly collected electronic data - will undoubtedly expand the study of
statistics and bring our understanding and governance to new heights.
Huff
observed that “without writers who use
the words with honesty and understanding and readers who know what they mean,
the result can only be semantic nonsense.” Today many statisticians are still using terms
like “Type I error” and “Type II error” in promoting statistical understanding,
while these concepts and underlying pitfalls are seldom mentioned in Big Data
discussions.
At the end of his book, Huff suggested
that one can try to recognize sound and usable data in the wilderness of fraud
by asking five questions: Who says so? How does he know? What’s missing? Did
somebody change the subject? Does it make sense? They are not perfect, but they are worth
asking. On the other hand, healthy
skepticism should not become overzealous in discrediting truly sound and
innovative findings.
Faced with the self-raised question of
why he wrote the book, especially with the title and content that provides ideas
to use statistics to deceive and swindle, Huff responded that “[t]he crooks
already know these tricks; honest men must learn them in defense.”
How I wish there is a book about how to
lie with Big Data now! In the meantime,
Huff’s book remains as enlightening as it was 45 years ago although the price
of the book has gone up to $5.98 and is almost matched by its shipping cost.
Jeremy S. Wu, Ph. D.
How to lie with Big Data?
ReplyDeleteOne should tell one Big lie and stick to it, even at the expense of looking ridiculous!