A combination of traditional censuses and the introduction of random
surveys served to measure and infer on populations and economies in the 20th
century. Both statistical approaches
have been important in supporting decision- and policy-making worldwide, as
well as informing the public. The 21st
century has begun with the massive conversion to digital data and the explosive
growth of Big Data around the globe, which in turn stimulates an insatiable
demand for ever more timely and comprehensive response to information
needs. Since conventional approaches to
censuses and surveys are static and cross-sectional, they will not be able to
meet these expanding dynamic requirements without fundamental changes. In the 21st century the defining
characteristics of statistical systems and methods will be the sophisticated application
of massive longitudinal data, integration of multiple data sources, and rapid and
simple delivery of results, while still strictly protecting confidentiality and
data security and assuring accuracy and reliability. Leaders of the statistical agencies from several
nations including the United States have recently reiterated these needs and
trends. Government agencies that can
successfully overcome these issues will help their nations enjoy unique
advantages in global competition; otherwise, they will face certain obsolesence. As a rapidly growing economic power whose
statistics are receiving more attention and having greater impact in the world,
China faces many of the same challenges.
This article identifies some of the success stories emerging in the U.S.
and other nations and discusses the needed changes in statistical paradigms to
meet the challenges of dynamic, integrated data systems for the 21st
century.
The 20th Century
Statistical Systems
Taking a census, i.e. collecting data from every entity in a target
population, has been the traditional statistical method to measure the profile and
characteristics of a population for centuries.
China reported its first population census more than 2,200 years ago
[1]. By the Western Han Dynasty around 2
A.D., available records [2,3,4,5] placed China’s population at almost 58
million in over 12 million households. The
People’s Republic of China enacted its first laws on the governance of statistics
in 1983 [6]. China has taken six population
censuses since 1949, and every ten years since 1990 [7]. Continuing for more than two centuries as
required by its constitution, the United States (U.S.) conducted its 23rd
and most recent decennial national population census in 2010 [8,9].
Myriad other topical censuses, such as those on the economy, industry,
and agriculture, are also commonly conducted in the U.S., China, and other
nations. For example, the U.S. conducts
an economic census of business activities every five years. The next economic census is scheduled to start
in 2012 [10]. The 2007 economic census
covered 24 million businesses in the non-farm private economy, accounting for
about 96% of the U.S. Gross Domestic Product [11]. China conducted its last economic census in
2008 [12]. Although each census may have
different legal origins or motivations, the ultimate purpose is similar – to
provide relevant, current, and reliable information for research, analysis, and
ultimately decision- and policy-making.
While the census has demonstrated its importance for many centuries, it
has several well-known practical shortcomings. Most of
all, human activities are continuous and dynamic over time, but a census can
only provide a comprehensive snapshot on a designated census day or a defined
period of time. Census results typically
become outdated as soon as they are released.
Dynamic human behavior and social, economic and political phenomena
cannot be fully captured by a census taken at a single point in time. The operation of a national census is typically
so complex that multiple years are needed to design and collect data. More time is then spent to process, analyze,
and report the results. The cost of a national
census has become so prohibitively high that it is usually supplemented by smaller
random surveys to provide more frequent results.
After more than a decade of design, development, and testing, the
American Community Survey (ACS) [13] began to implement “continuous
measurement” on the characteristics of U.S. population and housing in
2005. About 3 million addresses are
sampled per year (or 250,000 addresses per month) for a 5-year rolling
cycle. The ACS produces estimates by
aggregating data collected in the monthly surveys over a period of time so that
they would be summarized annually based on the calendar year. For local geographies with small populations, the
ACS estimates may take up to five years to aggregate and become reportable [14].
China reports its total population and structural changes in 2011
according to the National Sample Survey on Population Changes, which is
described as a stratified, multi-stage, cluster, probability proportional to size
sampling method. Nearly 1.5 million
persons in 31 provinces, 4,800 villages, 4,420 townships, and 2,133 counties
were reportedly interviewed for the most recent update [15].
Random sampling is a relatively new concept, introduced by the director
of the Norwegian Statistics Bureau to the International Statistical Institute
(ISI) in 1895 [16]. The international statistical
community spent more than 30 years debating its merit before deciding that random
sampling is an acceptable and sound scientific practice. During this period, theories and practices of
today’s mathematical statistics developed and grew to support the sampling
approach.
The first Department of Statistics in an arts and science colleges in
the U.S. was established by the George Washington University in 1935 [17]. Academia would become the primary training
ground for future statisticians. According
to the U.S. Census Bureau, statistical sampling methods were first used in its
1937 test survey of unemployment, partly in response to the need for more
timely information about the scope of unemployment during the Great Depression
[18]. Governments would become the
primary employer of future statisticians.
Supported by new theories and tested by applications in many fields, combined
with the introduction of commercial computers in the 1950s and subsequent desktop
computing, random surveys soon became the standard statistical practice to
collect data and perform statistical analyses for making informed decisions. The
foundation for today’s statistical systems was built primarily from computing
technologies developed in the 1970s, before the commercialization of the
Internet ushered in the advanced information age in the 1990s.
By the end of the 20th century, statistical systems including
census and survey data were not only core governmental operations, but also the
analytical foundation for market research, political predictions, agriculture
and economic development planning, environmental management, public health,
transportation planning, physical sciences, and other human and societal activities. However, data must be collected according to
statistical designs, including the application of probability principles,
before they can be used for making statistical inference. Large-scale statistical analyses were typically
conducted by statisticians and subject-matter experts in either government or
academia.
21st
Century Information Needs and Trends
The first decade of the 21st
century was marked by the rapid conversion of data from analog to digital, as
well as its quick acceptance and growth by the rapidly increasing Internet
users, most of them are not statistical experts in academia or government.
Visualized: A Zettabyte [21] |
The capacity to create and store digital
data reportedly exceeded one zettabyte (1 ZB or 1021 bytes) for the
first time in 2010 [22,23], compared to about 0.29 ZB in 2007 and 0.00002 ZB in
1986 [19,20]. An industry executive declared
that “(e)very two days now we (human beings) create as much information as we
did from the dawn of civilization up until 2003” [24]. To illustrate the relative magnitude, the
entire human genome [25] containing 3 billion chemical bases along the
chromosomes of an individual can be captured in about 3 gigabytes (3 GB or 0.000000000003
ZB) of computer storage, relatively modest according to today’s standards. In contrast, the Alpha Magnetic Spectrometer
[26] records cosmic ray data at about 1GB per second.
In practical terms, it means that paper
records are becoming obsolete, the private sector is also generating large
amounts of data, and billions of data consumers are not necessarily
specialists.
Complete sets of data are easily
captured into electronic files for direct machine processing and computation
without the need or consideration for sampling.
The speed of this enormous change was concurrently matched by the spread
of electronic data beyond political and geographical boundaries. Access to and use of information technology
is now pervasive if not common place in developed nations as well as in less
developed nations. No matter where a
computer is located in the world, it can be accessed as long as it is connected
to the Internet.
Big Data is a new and loosely defined
term for large electronic datasets that may or may not be collected according
to the structure and probability principles specified in the traditional statistical
systems. Administrative records, social
media, barcode and radio frequency scanners, transportation sensors, energy and
environmental monitors, online transactions, streaming videos, and satellite
images have all contributed to the explosive growth of Big Data. Most of these Big Data are not structured for
conventional statistical analyses and inferences, nor are they simple or easy
to use initially with current software and statistical systems. However, some contain important information
that has not been available before for decision- and policy-making, especially
when they are appropriately integrated into government data sources.
The private sector has led the way in
generating Big Data, integrating them with government statistics, and
developing data mining techniques and methods to identify potential consumers,
expand markets, test new products, and extract information for market and
consumer research. In some cases, they
may even challenge traditional government functions. For example, certain search terms [27] in the
social media may be good indicators of flu activity and can perform as well as the
indicators produced by the public health agencies, if not actually better in
terms of reduced lag time.
Despite its diminished share in the
ocean of available data, government statistics remain uniquely important in
support of an increasingly global economy and expanding social needs of each
nation. However, in an era when search
engines produce millions of results in seconds and international stock market data
are reported in almost real-time around the clock, taking years and even months
to collect, process, and release static results for limited coverage of
geography, industry, or demographics is rapidly losing relevance.
Most nations, even developed nations,
are facing severe budgetary constraints.
The high costs and limited return with the current approach preclude the
introduction of new censuses and surveys or the feasibility of any major
expansion of the current census and sampling approach. Declining response rates worldwide compounds
the problem. For example, despite
intense planning and efforts, the participation rate for the 2010 decennial
census in the U.S. barely matched 74 percent achieved in 2000 [28]. Follow-up personal interviews would increase
the average census cost to $56 per household [29], about 100 times the original
mailing cost.
In fact, the U.S. House of
Representatives voted in May 2012 to terminate the American Community Survey,
citing both confidentiality and budgetary concerns. It is also uncertain now whether the 2012
economic census will be conducted in the U.S. as originally planned.
The challenge to the national
statistical agencies is real and daunting: the 20th century
statistical systems can no longer adequately meet the needs of the 21st
century. Consumers of government
statistics are rapidly increasing in number and breadth. They require more comprehensive, dynamic, and
timely data that can be accessed and understood easily, but the resources and
time of development required by the existing methods are simply not available
or affordable. Governments are still
expected to provide statistics that are accurate and reliable, while strictly
protecting the confidentiality of the responding entities.
Failing to meet these requirements, the
Australian Bureau of Statistics is not sure that it “will remain at the heart
of official information for societies” [30]. The arrival of the Big Data era, along with the
growing user requirements, is inevitable, and yet many governments and their
statistical agencies are still unprepared in making the best use of Big Data.
Fundamental change in the historical census and survey paradigm will be
necessary to meet these challenges of the 21st century. Small evolutionary steps to tweak and tinker
the edges of the current statistical systems built on knowledge and technologies
grounded in the 1970s will simply not be adequate for the Big Data Revolution.
Characteristics
of the 21st Century Statistical Systems
The defining characteristics of the 21st century statistical
systems will be the sophisticated application of massive longitudinal data,
integration of multiple data sources, and rapid and simple delivery of results,
while still strictly protecting confidentiality and data security and assuring
accuracy and reliability.
Longitudinal data refer to repeated observations of the same entity
(such as a worker, a student, a household, a business, a school, or a hospital)
over time. They provide unique measures
of a baseline and change at the individual or business level. These measures are not tracked and measured
in the conventional cross-sectional studies, which collect data from multiple
subjects at the same point in time.
In particular, longitudinal administrative records are potential data
sources for developing a comprehensive statistical system. Statistics Canada defines administrative
records simply as “data collected for the purpose of carrying out various
non-statistical programs” [31]. Other
potential data sources include birth and death certificates, customs
declarations, marriage and driver’s licenses, individual and business taxes,
unemployment insurance payments, social security, and medical prescriptions. There are a number of examples of massive
amounts of longitudinal administrative records.
- A new business must complete forms to register before it can start operation. Reports are produced to pay salaries and taxes on a regular basis. Additional paperwork must be completed if loans are made or if there are mergers and acquisitions. Corporations must file applications before their shares can be publicly traded.
- A student must fill out forms to enter a school. He or she must register to enroll in classes. Individual grades and test scores are recorded. A transcript is needed to move from one school to another. A diploma or degree is issued when a student graduates.
- Similarly, there are records for each person’s visit to a doctor’s office or admittance to a hospital, vital health signs that are measured during the visit, symptoms of illness, and amount and type of medical prescriptions.
Under proper design and automation, cost of linking electronic data
records is only a fraction of the cost of labor-intensive survey or census data
collection. There is also no additional
burden on the respondents because the administrative records already
exist. Once established, the need to collect
individual demographic data such as gender, date of birth, race, and ethnicity
will be greatly reduced because they do not change or they change in
predictable manners.
The potential of integrating administrative records into statistical
systems and substituting for a population census were discussed and debated
vibrantly during the last two decades of the 20th century [e.g., 32,33,34,35,36]. Pioneered by Denmark in 1981, at least 20 out
of 27 European Union nations are now using population registers or a
combination of population registers and the traditional census to count their
populations [37].
Although longitudinal studies have been used quite extensively in clinical trials for many years, their integration and applications in other areas have been sparse and limited due largely to complex design, high cost of processing and data storage, difficulty in understanding and accessing the data, and concerns about protecting confidentiality.
In a recent blog by the Director of the U.S. Census Bureau on a summit
meeting between the leaders of the government statistical agencies from
Australia, Canada, New Zealand, United Kingdom, and the U.S. [38], consensus
and shared vision were reported about the 21st century official
statistical systems. Among the future
vision is:
“Blending together
multiple available data sources (administrative and other records) with
traditional surveys and censuses (using paper, internet, telephone,
face-to-face interviewing) to create high quality, timely statistics that tell
a coherent story of economic, social and environmental progress must become a
major focus of central government statistical agencies.”
Government statistical agencies must
continue to create and maintain frames for conducting censuses or surveys, but
by making additional, optimal use of available data sources. Such frames have been static and minimal in
content in the past. In the 21st
century, these frames must become dynamic in structure and rich in content,
capable as the first response to produce comprehensive, top quality and timely statistics
regularly and on demand, integrating new relevant data elements and sources as
they are identified and introduced. These
dynamic national frames will include both statistical and geographical data for
companion mapping and reporting, and serve in a secondary role of a traditional
frame for census or survey where needed.
“Telling a coherent story” is part of the needed change in paradigm for
statistical agencies in the 21st century. The role of descriptive statistics has long
been relegated as secondary or supplemental to statistical inference by
statistical professionals. Modern data
visualization methods and applications to extract information dynamically from complex
data are in every way a valuable, statistical practice in the Big Data era. When government and academic experts are no
longer the only or even dominant data suppliers and data analysts, ease of
understanding, access, and use must also be an integral part of rapid delivery
of results.
The assembly and maintenance of a comprehensive, dynamic statistical
system requires massive amounts of sensitive personal and business data. However, the end results must be in the form
of statistical summaries that are totally void of the possibility of
identification or re-identification of the original entities. Individuals and enterprises should rightly be
concerned and informed about protection of their confidentiality against any misuse
and abuse of their data. Integrity and
security of the infrastructure data, as well as the output statistics, must also
be strictly safeguarded against intended or malicious tampering and alteration.
Emerging Success Stories
Several countries have initiated public integrated longitudinal data
program on employment, education, and public health. These initiatives are at various stages of
development, and provide encouraging news about the feasibility of creating and
maintaining comprehensive dynamic statistical systems in the Age of Big Data,
although there are still many challenges.
An international symposium was held in 1998, featuring research using
integrated employee-employer data from more than 20 nations [36,39]. The U.S. Census Bureau started the
Longitudinal Employer-Household Dynamics program later that year to create new,
innovative statistical products by linking existing employer-employee data [40].
Today the U.S. federal government and each of the 54 state, city, and
territorial governments have agreed to secure the continuing supply of
unemployment insurance wage records for workers and employers from the states
on a quarterly basis. The U.S. Census
Bureau updates and maintains a longitudinal national frame of jobs going as far
back as 1990. Each job connects a worker
with an employer, and a worker can have multiple jobs. This data infrastructure was designed to track
and refresh the employment status and pay for each of the over 140 million
workers and more than 10 million employers (including the self-employed) every
3 months, while still strictly protecting the confidentiality of each entity by
legal, policy, physical and methodological means.
The longitudinal data infrastructure has stimulated the development of
creative, practical online applications using the new data, such as time series
indicators to describe the underlying dynamics of the U.S. workforce at
unprecedented levels of demographic and geographic detail [41]. In addition, an innovative mapping and
reporting application allows a user to select any geographical area online to
produce worker profiles and potential commuting reports [42], as well as almost
real-time assessments of the potential impacts of hurricanes and other natural disasters
in emergency situations [43]. The
application was presented as an innovative statistical product in the United
Nations Statistical Commission [44] and received the gold medal from the U.S.
Department of Commerce, the highest form of recognition for scientific
accomplishments in the department.
The Data Quality Campaign (DQC) [45] was launched in 2005 to empower
stakeholders of the U.S. education system, including students, parents,
teachers, and policy-makers, with “high quality data from the early childhood,
K-12, postsecondary, and workforce systems to make decisions that ensure every
student graduates high school prepared for success in college and the
workplace.” To achieve this vision, “DQC
supports state policymakers and other key leaders to promote the development
and effective use of statewide longitudinal data systems.”
The U.S. Departments of Education and Labor concurrently issued competitive
grants to states for the construction and integration of these comprehensive statewide
longitudinal data systems. In the words
of DQC, “we can no longer afford to not use data in education” to make informed
decisions. In particular, DQC identified
“10 Essential Elements of Statewide Longitudinal Data Systems” and “10 State
Actions to Support Effective Data Use” as roadmaps for state policymakers. Status and progress of each state have been
tracked by annual surveys since 2005.
The Health Information Technology for Economic and Clinical Health Act
of 2009 [46] established the goal of widespread adoption and meaningful use of
electronic health records by 2014 in the U.S. Belgium [47] reported on the Belgian
Longitudinal Health Information System initiative in 2011 by broadly defining
health-related data as “all personal data that concerns past, current or future
states of the physical or mental health of the person.” The research focused on the longitudinal
approach of health and made comparison to international initiatives, including
those in Canada, Denmark, and United Kingdom.
China has also initiated major public health reforms [48] since 2009,
including a national longitudinal system of electronic health records [49] covering
its 1.3 billion citizens as part of its institutional infrastructure. Key policies have been established, and the system is being populated.
Major Challenges and China
The U.S. statistical system is highly decentralized. Although the 2012 share of the budget
resources supporting federal statistics is only 0.02 percent of the Gross
Domestic Product in the U.S., it is spread across 13 principal statistical
agencies and more than 85 other agencies that carry out statistical activities
along with their non-statistical program missions [50]. A sizable portion of the U.S. efforts have
been spent on overcoming the barriers inherent in its decentralized structure –
inadequate data sharing, competing data quality standards, unnecessary
duplication, multiple administrative costs, and resolving difficulty of data
access.
For example, the U.S. Census Bureau and the Bureau of Labor Statistics
maintain two separate Business Registers; each register is supposed to
represent the universe of U.S. businesses.
These registers are the sampling frames from which surveys and censuses
are drawn, contributing to important information such as the national economic
indicators for the U.S. However, due to
their independent sources and dynamic nature, the two registers have
substantial differences in the number of firms and their respective payroll and
employment [51]. Although progress has
been made in the last decade, a single source Business Register for the U.S.
has yet to emerge.
The White House announced a “Big Data Research and Development
Initiative” [52] in March 2012, providing $200 million in new research and
development investments to improve the ability to extract knowledge and
insights from large and complex collections of digital data. Therefore, efforts continue to bring
convergence on Big Data inside the statistical agencies in the U.S. Top governmental funding, commitment, and
leadership, especially in the transparency and openness towards data [53], will
also be needed in other nations including China.
A recent Baidu search showed that awareness of the Big Data issues in
China appears to be sporadic and but increasing rapidly in the last 6 months. The search results included a translation of
a February 11 New York Times article on “The Age of Big Data” [54], an
interview with the author who published the first known Chinese-language book [55,56]
on the topic of Big Data on July 14, and a media report on how Big Data may threaten
individual privacy [57], also on July 14.
A seminar hosted by Tsinghua University on July 1, 2012 was one of very
few known activities about how Chinese statistical bureaus, universities and
the private sector are actually dealing with Big Data and its impact on the
statistical systems. A notable exception
was the research efforts being conducted by the Alibaba Group, which has
millions of small companies using its site and billions of dollars of e-Commerce
transactions in China every day.
The need for top quality statistics is no less in China than in other
nations. Many of the key targets for
China’s 12th five-year plan [58] are defined in quantitative terms. As China is transforming from an
export-oriented nation to a consumer-oriented nation, the status and progress of
each goal over time will be measured and evaluated by statistics and indicators
that must be credible, reliable, and timely.
As China’s economic growth slows recently, in-depth data are needed to
understand the latest trends and patterns, as well as to assist in developing potential
mid-course modifications and corrections.
These statistics and indicators have great influence on the global
economy as China ascends into world power status. As the saying goes, when China sneezes, the
rest of the world catches cold.
A recent report from the Chinese academia [59] provided an overview of a
longitudinal micro-level data base focused on the corporate behavior and
performance in China. The data base is
known variously as the “Chinese Industrial Enterprises Database,” “China Annual
Survey of Industrial Firms,” or “China Annual Survey of Manufacturing Firms.”
Based on regular and annual reports submitted by sample enterprises to
the local statistical bureaus, the National Bureau of Statistics of China assembles
and maintains this database covering all state-owned and large-scale
non-government firms beginning in 1998.
Its largest industrial component is the manufacturing sector. This economic database is the only supplement
to the Chinese economic census, representing about 90% of the sales volume of
all industrial enterprises in China in 2004.
Small and medium-size businesses, as well as e-Commerce companies, are
not included.
The article identified nine areas for which important information can
be extracted from the database and described its increasing domestic and
international interests and use for analysis.
However, the database “suffers from data matching problems as well as
measurement errors, unrealistic outliers and definition ambiguities etc., all
of which practically lead to research results thereupon that are at best
questionable.”
The article appeals for effective leadership and management to overcome
the fundamental problems that seriously undermine a valuable longitudinal statistical
system. Herein underlies some of the
major challenges for the development of the 21st century statistical
system that are shared across nations.
Not all Big Data are structured or suitable for integration into statistical
systems for intended statistical use. Optimal
extraction of information from data and total quality management are core values
and functions of the statistics profession.
Awareness of the limitations of data, as well professional
considerations against false discovery, bias, and confounding, is part of the
value that can be added by statisticians [60].
With knowledge accumulated from the past centuries, the statistics
profession is well positioned to address the many challenges of Big Data, prompting
the prophecy that “the sexy jobs for the next 10 years will be statisticians”
[61].
Actual experience and empirical evidence have suggested at least the
following list of major potential contributions by statisticians:
- Record Linkage. Design of system and application of exact and probability matching techniques to improve record linkage from multiple data sources and to minimize mismatches for small populations.
- Imputation. Development and application of imputation techniques to reliably replace missing values created by merging data from different data sources, but not to create unsupported, artificial information.
- Data Quality Assurance. Establishment of sound methods and rules to continuously measure and detect the presence of outlying or influential observations and to apply best, appropriate resolutions.
- Evolving Standards. Standardization of terms and definitions to provide consistent understanding, but be sufficiently flexible to adjust for new concepts such as “green industry.”
- Statistical modeling. Mathematical abstraction and statistical application that can range from imputing missing values, profiling markets and customers, assessing risks, predicting future occurrences, to creating artificial intelligence.
- Data Visualization and Innovative Applications. Innovative and timely dissemination and presentation to develop coherent stories, market new concepts, and improve statistical education.
- Confidentiality Protection. Development and application of statistical methods and rules such as noise infusion and synthetic data to protect confidentiality of individual entities and to quantify the level of protection being applied.
Big Data is more than another technological advancement that only improves
statistical computation. It is a
revolution that challenges conventional statistical thinking and stimulates
innovative thinking and development.
Some theories in mathematical statistics that have been prevalent for
the last century may need extensions. For
example, while it is well known that a 5% random sample will yield better
measurable properties than a 5% non-random sample, it is unknown how a 5%
random sample will compare with a 30%, 50%, or higher non-random sample, which
is common in Big Data situations. How
should the metrics be modified for story-telling instead of inference-making?
As happened when the random sampling concept was first introduced in
1895, construction of the 21st century statistical systems will be
empirically guided and happen concurrently with theoretical development. However, it is inconceivable for the
international statistics community to take more than 30 years to welcome and
embrace the use of Big Data.
The revolutionary and innovative development of 21st century
statistical systems using Big Data will be multi-disciplinary, involving
expertise from statistics, computer science, geography, and subject matters
such as economics, education, energy, environment, health care, and
transportation.
It will require academic-public-private partnership. Subject to the will and appropriateness to
share data, the private sector is a key supplier to the 21st century
statistical systems. The role of academia
is still critically important in conducting basic research, training future “data
scientists,” and developing supporting theories. The state of Massachusetts, the Massachusetts
Institute of Technology, and multiple private-sector companies have recently joined
and taken the first step in this direction in the U.S. [62].
Summary
Taking a census has been the official statistical method for the last
two thousand years. In the last century,
random sampling was introduced and random surveys became the dominant statistical
method. When sample units are selected
according to probability theory, results from a small fraction of the
population can be used to make valid inferences about the entire population
with measurable accuracy and reliability.
Modern information technologies started in the 1990s ushered in the Big
Data era. Massive amounts of data are
becoming available; the number of potential data users with capability to
access data has increased explosively; and the cost of storing and processing
has decreased dramatically. The need to collect,
analyze and disseminate data widely, timely, and comprehensively in a global
economy is firmly established for the 21st century.
Most national statistical agencies will not be able to improve and
expand their current practices on their own.
Current theories of mathematical statistics are not sufficient to
support the empirical use of Big Data.
Twentieth-century statistical systems rooted in the 1970s technologies
are no longer sufficient to meet the requirements of the 21st
century.
This paper outlines the basic challenges facing all nations and
provides emerging success stories of how Big Data from multiple sources can be
successfully integrated to construct longitudinal data systems.
Development of innovative, dynamic 21st century statistical systems with support of a new statistical foundation is both feasible and necessary. It will require government commitment and leadership, academic-public-private partnerships, concurrent multi-disciplinary research and development, and making innovative use of statistical thinking from the past centuries. The prophecy that “the sexy jobs for the next 10 years will be statisticians” can be realized. On the other hand, failure to make a paradigm change now will likely lead to irrelevance and even disappearance of national statistical agencies and the loss of competitive advantage for a nation in the global economy.
Development of innovative, dynamic 21st century statistical systems with support of a new statistical foundation is both feasible and necessary. It will require government commitment and leadership, academic-public-private partnerships, concurrent multi-disciplinary research and development, and making innovative use of statistical thinking from the past centuries. The prophecy that “the sexy jobs for the next 10 years will be statisticians” can be realized. On the other hand, failure to make a paradigm change now will likely lead to irrelevance and even disappearance of national statistical agencies and the loss of competitive advantage for a nation in the global economy.
References
[1] National Bureau of Statistics of China. “History of Statistics Prior to Qin
Dynasty.” Available at http://www.stats.gov.cn/50znjn/t20020617_22676.htm
on July 16, 2012.
[2] National Bureau of Statistics of China. “History of Statistics during the Qin and Han
Dynasties.” Available at http://www.stats.gov.cn/50znjn/t20020617_22677.htm
on July 16, 2012.
[3] Wikipedia. “Census.” Available at http://en.wikipedia.org/wiki/Censuses#cite_note-9
on July 16, 2012.
[4] Hays, Jeffrey. “China –
Facts and Details: Han Dynasty (206 B.C. – A.D. 220).” Available at http://factsanddetails.com/china.php?itemid=39&catid=2&subcatid=2
on July 16, 2012.
[5] Loewe, Michael. "The Former Han dynasty." The Ch'in and Han Empires, 221 B.C.–A.D. 220. Eds. Denis
Twitchett and John K. Fairbank. Cambridge University Press, 1987. Available at http://histories.cambridge.org/extract?id=chol9780521243278_CHOL9780521243278A004
on July 16, 2012.
[6] National Bureau of Statistics of China. “Statistical Laws of The People’s Republic of
China.” Available at http://www.stats.gov.cn/tjfg/tjfl/t20090629_402568265.htm
on July 16, 2012.
[7] National Bureau of Statistics of China. “How Many Years to Conduct a Census; How Many
Censuses China Has Conducted.” Available
at http://www.stats.gov.cn/zgrkpc/dlc/zs/t20100419_402635505.htm
on July 16, 2012.
[8] U.S. Census Bureau. “What is
The Census?” Available at http://2010.census.gov/2010census/about/
on July 16, 2012.
[9] National Bureau of Statistics of China. “How Does The United States Conduct Its Population
Census?” Available at http://www.stats.gov.cn/zgrkpc/dlc/zs/t20100526_402645146.htm
on July 16, 2012.
[10] U.S. Census Bureau. “Economic
Census.” Available at http://www.census.gov/econ/census/index.html
on July 16, 2012.
[11] U.S. Census Bureau. “About
the 2007 Economic Census.” Available at http://bhs.econ.census.gov/ec07/about.html
on July 16, 2012.
[12] National Bureau of Statistics of China.
“Communiqué on Major Data of the
Second National Economic Census (No.1)”.
December 25, 2009. Available
at http://www.stats.gov.cn/english/newsandcomingevents/t20091225_402610168.htm,
on July 16, 2012.
[13] U.S. Census Bureau. “Design
and Methodology – American Community Survey.”
Chapter 2. Program History.
Available at http://www.census.gov/acs/www/Downloads/survey_methodology/acs_design_methodology_ch02.pdf
on July 16, 2012.
[14] U.S. Census Bureau. “Design
and Methodology – American Community Survey.”
Chapter 13. Preparation and Review of Data Products. Available at
http://www.census.gov/acs/www/Downloads/survey_methodology/acs_design_methodology_ch13.pdf
on July 16, 2012.
[15] National Bureau of Statistics of China. “China’s Total Population and Structural
Changes in 2011.” Available at http://www.stats.gov.cn/enGliSH/newsandcomingevents/t20120120_402780233.htm
on July 16, 2012.
[16] Wu, Jeremy S., Chinese translation by Zhang, Yaoting and Yu,
Xiang. “One Hundred Years of Sampling,”
invited paper in “Sampling Theory and Practice”. ISBN7-5037-1670-3. China Statistical Publishing Company, 1995.
[17] The George Washington University.
“The Department of Statistics”.
Available at http://departments.columbian.gwu.edu/statistics/
on July 16, 2012.
[18] U.S. Census Bureau.
“Developing Sampling Techniques”.
Available at http://www.census.gov/history/www/innovations/data_collection/developing_sampling_techniques.html
on July 16, 2012.
[19] The Washington Post. “Rise
of the Digital Information.” Available
at http://www.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.html
on July 16, 2012.
[20] Hilbert, Martin and Lopez, Priscila. “The World’s Technological Capacity to Store,
Communicate, and Compute Information.”
Science 1 April 2011: Vol. 332 no.6025 pp.60-65. DOI:10.1126/science.
1200970. Available at http://www.sciencemag.org/content/332/6025/60.abstract
on July 16, 2012.
[21] Savov, Vlad. “Visualized: a
zettabyte” June 29, 2011. Available at http://www.engadget.com/2011/06/29/visualized-a-zettabyte/
on July 16, 2012.
[22] International Data Corporation.
“The Diverse and Exploding Digital Universe.” Sponsored by EMC Corporation, March
2008. Available at http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
on July 16, 2012.
[23] Data Center Knowledge.
“”Digital Universe’ Nears a Zettabyte.”
May 4, 2010. Available at http://www.datacenterknowledge.com/archives/2010/05/04/digital-universe-nears-a-zettabyte/
on July 16, 2012.
[24] TechCrunch. “Eric Schmidt:
Every 2 Days We Create As Much Information As We Did Up To 2003.” August 4, 2010. Available at http://techcrunch.com/2010/08/04/schmidt-data/
on July 16, 2012.
[25] Human Genome Project.
“Frequently Asked Questions.” Joint international project under the U.S.
Departments of Energy and the National Institute of Health. Available at http://ornl.gov/sci/techresources/Human_Genome/faq/faqs1.shtml
on July 16, 2012.
[26] Wikipedia. “Alpha Magnetic
Spectrometer.” Available at http://en.wikipedia.org/wiki/Alpha_Magnetic_Spectrometer
on July 16, 2012.
[27] Google. “Explore Flu Trends
around the World.” Available at http://www.google.org/flutrends/ on
July 16, 2012.
[28] U.S. Census Bureau. “2010
Census Mail Participation Rate Map.”
Available at http://2010.census.gov/2010census/take10map/
on July 16, 2012.
[29] El Nasser, Haya; and Overberg, Paul. “2010 Census Response Rate Surprisingly Close
to 2000 Rate.” USA Today. April 26, 2010. Available at http://www.usatoday.com/news/nation/census/2010-04-20-census-participation-rate_N.htm
on July 16, 2012.
[30] Pink, Brian; Borowik, Jenine; Lee, Geoff. “The Case for an International Statistical
Innovation Program – Transforming National and International Statistics
Systems.” Supporting paper, Australian
Bureau of Statistics. 10/2009. Available at http://www.abs.gov.au/websitedbs/d3310114.nsf/4a256353001af3ed4b2562bb00121564/064584f68877204fca2576c0001a0fa8/$FILE/Supporting%20Discussion%20Paper.pdf
on July 16, 2012.
[31] Statistics Canada. “Administrative Data Use.” Available at http://www.statcan.gc.ca/pub/12-539-x/steps-etapes/4147786-eng.htm
on July 16, 2012.
[32] Brackstone, G.J. “Issues in the use of administrative records for
statistical purposes.” Survey Methodology. Vol. 13. p. 29–43, 1987.
[33] Scheuren, Fritz and Petska, Tom.
“Turning Administrative Systems into Information Systems.” Available at http://www.oecd.org/dataoecd/58/48/36236959.pdf
on July 16, 2012.
[34] Office of Management and Budget.
“Seminar on Quality of Federal Data.”
Part 1 of 3. Federal Committee on
Statistical Methodology, March 1991.
Available at http://www.fcsm.gov/working-papers/wp20a.html,
on July 16, 2012.
[35] Organization for Economic Co-operation and Development. “Use of Administrative Sources for Business
Statistics Purposes.” Handbook of Good
Practices. Available at http://www.oecd.org/dataoecd/58/1/36237357.pdf
on July 16, 2012.
[36] Haltiwanger, John; Lane, Julia; Spletzer, Jim; Theeuwes, Jules;
and Troske, Ken. “Conference Report:
International Symposium on Linked Employer-Employee Data.” Monthly Labor Review, July 1998. Available at http://bls.gov/mlr/1998/07/rpt2full.pdf
on July 16, 2012.
[37] Valente, Paolo.
“Census Taking in Europe: How are Populations Counted in 2010?” Bulletin Mensuel d’Information de L’Institut National
d’Études Démographiques. Population and Societies, No. 467, May 2010. Available at http://www.unece.org/fileadmin/DAM/publications/oes/STATS_population.societies.pdf
on July 16, 2012.
[38] Groves, Robert M.
“National Statistical Offices: Independent, Identical, Simultaneous
Actions Thousands of Miles Apart.” U.S.
Census Bureau, February 2, 2012.
Available at http://blogs.census.gov/directorsblog/
on July 16, 2012.
[39] Haltiwanger, John C.; Lane, Julia I.; Spletzer, James, R.;
Theeuwes, Jules J.M.; Troske, Kenneth R.
“The Creation and Analysis of Employer-Employee Matched Data:
Contributions to Economic Analysis.”
North Holland, 1999.
[40] Wu, Jeremy S. “State
of Longitudinal Employer-Household Dynamics Program.” Unpublished manuscript, U.S. Census Bureau,
January 2006.
[41] U.S. Census Bureau. “Quarterly
Workforce Indicators Online.” Available
at http://lehd.ces.census.gov/led/datatools/qwiapp.html
on July 16, 2012.
[43] U.S. Census Bureau.
“OnTheMap for Emergency Management.”
Available at http://onthemap.ces.census.gov/em.html
on July 16, 2012.
[44] Mesenbourg Jr., Thomas.
“Innovations in Data Dissemination.”
United Nations Statistical Commission Seminar on Innovations in Official
Statistics, February 20, 2009. Available
at http://unstats.un.org/unsd/statcom/statcom_09/seminars/innovation/innovations_seminar.htm
on July 16, 2012.
[45] Data Quality Campaign.
“Using Data to Improve Student Achievement” Website. Available at http://www.dataqualitycampaign.org/
on July 16, 2012.
[46] U.S. Department of Health and Human Services. “Accelerating Electronic Health Records
Adoption and Meaningful Use.” August 5,
2010. Available at http://www.hhs.gov/news/press/2010pres/08/20100805c.html
on July 16, 2012.
[47] Ecole de
Santé publique, Vakgroep Sociaal Onderzoek – SOCO, and Institut de Recherche
Santé et Société. “Belgian Longitudinal
Health Information System: Supplement the health information system by means of
longitudinal data. Summary of the
research.” Project AGORA AG / JJ / 139.
February 2011. Available at http://www.belspo.be/belspo/organisation/publ/pub_ostc/agora/agJJ139_synth_en.pdf on July 16, 2012.
[48]
International Health Economics Association.
“China Forum.” Available at http://ihea2011.abstractsubmit.org/sessions/405/ on July 16, 2012.
[49]
sina.com.cn. “China Will Build Unified
National Citizen Health Records; Apply Standardized Management.” April 7, 2009. Available at http://news.sina.com.cn/c/2009-04-07/140517561926.shtml on July 16, 2012.
[50] Office of Management and Budget. “Statistical Programs of the United States
Government: Fiscal Year 2012.” Available
at http://www.whitehouse.gov/sites/default/files/omb/assets/information_and_regulatory_affairs/12statprog.pdf
on July 16, 2012.
[51] Foster, Lucia; Elvery, Joel; Becker, Randy; Krizan, Cornell;
Nguyen, Sang; and Talan, David. “A
Comparison of the Business Registers used by the Bureau of Labor Statistics and
the Bureau of the Census. Office of
Survey Methods Research, Bureau of Labor Statistics, 2005. Available at http://www.bls.gov/ore/pdf/st050270.pdf
on July 16, 2012.
[52] The Executive Office of the President of the United
States. “Obama Administration Unveils
‘Big Data’ Initiative: Announces $200 Million in New R&D Investments.”
March 29, 2012. Available at http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf
on July 16, 2012.
[53] The White House. “Open Government Initiative.” January 21, 2009. Available at http://www.whitehouse.gov/Open/ on July 16, 2012.
[54] douban.com. “Arrival of the Big Data Era.” March 31, 2012. Available at http://www.douban.com/note/207694904/ on July 16, 2012.
[55] Tu, Zipei 涂子沛.”The Big Data Revolution 大数据:正在到来的数据革命.” Guangxi Normal University Publications. 广西师范大学出版社.
[56]
news.163.com. “An Interview with Tu
Zipei: Public Life of Dignity Needs ‘Big Data’.” July 14, 2012. Available at http://news.163.com/12/0714/02/86BEHDN600014AED.html on July 16, 2012.
[57]
news.163.com. “Disappearance of
Individual Privacy Upon the Arrival of the Big Data Era?” July 14, 2012. Available at http://news.163.com/12/0714/15/86CPGMN600014AED.html on July 16, 2012.
[58] China
Daily. “Key Targets of China’s 12th
Five-Year Plan.” Available at http://www.chinadaily.com.cn/china/2011npc/2011-03/05/content_12120283.htm on July 16, 2012.
[59] Nie,
Huihua; Jiang, Ting; and Yang, Rudai. “A
Review and Reflection on the Use and Abuse of Chinese Industrial Enterprises
Database.” To appear in World Economics,
Volume 5, 2012. Available at http://www.niehuihua.com/UploadFile/ea_201251019517.pdf on July 16, 2012.
[60]
Rodriguez, Robert. “Big Data and Better
Data.” AMSTAT News, President’s Corner,
American Statistical Association. May 31, 2012.
Available at http://magazine.amstat.org/blog/2012/05/31/prescorner/ on July 16, 2012.
[61] Varian,
Hal. “Hal Varian explains why
statisticians will be the sexiest job in the next 10 years.” September 15, 2009. YouTube.
Available at http://www.youtube.com/watch?v=pi472Mi3VLw on July 16, 2012.
[62]
Massachusetts Institute of Technology.
“MIT CSAIL & Intel Join State of Massachusetts to Tackle Big
Data.” Press release by MIT Computer
Science and Artificial Intelligence Laboratory.
May 30, 2012. Available at http://www.csail.mit.edu/node/1750 on July 16, 2012.
Very prospective and inspiring! Thanks for sharing.
ReplyDeleteJeremy-
ReplyDeleteYour posts in the LinkedIn groups, esp Advanced Analytics, on big data prompted me to explore your blog. This article in particular should be required reading for all discussants on those threads. The mandates you articulate, now over a year old, are clear and illuminating.
Thank you for this summary,
Best regards,
Thomas Ball
Cool and I have a swell offer you: How Long Do House Renovations Take average home remodel cost
ReplyDelete