Abstract
A frame identifies all the known units in a population
from which a census can be conducted or a random sample can be drawn, providing
the structural foundation for the extraction of maximum, reliable information from
designed statistical studies with the support of established statistical
theories. The significance of the Big
Data era is that most data are now digitized, easily stored, and processed in
large quantity at relatively low cost. Big Data offers
unprecedented opportunities for statisticians to rethink and innovate. Among the many
possibilities offered by Big Data is the creation and maintenance of Dynamic
Frames – frames that are rich in content, capture the most up-to-date data as
soon as they become available, and produce results and reports in real time on
demand.
Traditional
Population and Frame
A population is an important concept in
the study of statistics. It is commonly
understood to be an entire collection of items of interest, be it a nation’s
people or businesses, a day's production of light bulbs, or an ocean’s fish
[1,2,3].
A less well-known term is a frame, or a list of the
units that cover the entire population with its identification system. A frame is the working definition of a
population under study. It identifies
all the known units in a population from which a census can be conducted or a
random sample can be drawn, providing the structure for statistical description
and analysis about the population [2,4,5].
Figure 1 shows a flow chart of a conventional
statistical study by census or random sample.
Quoting from [4], an ideal frame should have the following qualities:
- All units have a logical, numerical identifier
- All units can be found – their contact information, map location or other relevant information is present
- The frame is organized in a logical, systematic fashion
- The frame has additional information about the units that allow the use of more advanced sampling frames
- Every element of the population of interest is present in the frame
- Every element of the population is present only once in the frame
- No elements from outside the population of interest are present in the frame
- The data is “up-to-date”
Modeling
may be considered part of a sampling process, sometimes bypassing the need for
a frame by assuming that the model and data adequately represent the
underlying population.
Practicing
statisticians understand the importance of frames – it is the structural
foundation for the extraction of maximum, reliable information from designed
statistical studies with the support of established statistical theories. However, there are few statistical papers or
forums that discuss the best practices for creating and maintaining a frame,
primarily because it is viewed as an administrative or clerical task.
Many
lament how difficult it is to obtain or maintain a good frame or their bitter
experience of working with incomplete or error-prone frames. Indeed, poor quality frames may prevent a
well-planned statistical study from even taking place or create misleading or
biased results.
Inadequate
attention to the creation and maintenance of a flexible, up-to-date, and
dynamic population frame has been costly to the statistics profession and the
U.S. in terms of efficiency and innovation.
For
example, according to [6], although “an accurate and complete address list is a
critical ingredient in all U.S. Census Bureau surveys and censuses,” each
program prepared its own separate list until the concept of a national frame
was advanced not even 20 years ago in the name of the Master Address File (MAF).
The MAF
is used primarily to support mail delivery of questionnaires [7], which is
increasingly an outdated mode for information collection. It is relied upon heavily for follow-up
visits to non-respondents, when rising labor costs are now met with tight
budget constraints. Web-based questionnaire
delivery or data submission was not allowed in the latest 2010 decennial census
in the U.S. The MAF is also not designed
to promote or support web-based applications.
The arrival
of the Big Data era seems to have caught the statistics profession in a
deer-in-the-headlight moment. As
statistician is hailed as “the sexiest job for the next 10 years” and beyond [8],
the profession is still wondering why statistics is undervalued and left out,
while in search of a role it should play in the Big Data era [9].
Only a
few seem to recognize that statistics is “the science of learning from data”
[10], regardless of how big or small the data are, and that the moment has
arrived for the profession to join the revolution and remain relevant in the
future.
Statistics 2.0: Dynamic Frames
Big Data
is a relative concept. Tomorrow’s Big
Data will be bigger than today’s Big Data.
If it is only the size of data that statisticians would consider, the
impact of Big Data would be limited to only scaling the existing software and
methods.
The
significance of the Big Data era is that most data are now digitized, including
sound, vision, and handwriting [e.g., 11], much of which have never been
available before. They can be easily
stored and processed in large quantity at relatively low cost. Today’s consumers of statistics are much
higher in number and less interested in technical details, but they also want
comprehensive, reliable, easy-to-use information rapidly and readily.
Big Data
is as much a revolution in information technology as it is for advancement in
statistics because it offers unprecedented opportunities for statisticians to rethink
its systems and operations and innovate.
For
example, mathematical statistics clearly demonstrates that a 5 percent random
sample is superior to a 5 percent non-random sample. However, how does it compare to a 50 percent
or a 95 percent non-random sample? We
have continued to caution, warn, condemn, or dismiss large, non-random samples,
but have done little to go beyond the existing framework of mathematical
statistics. Is there not a point, albeit
that it may vary from case to case, where the inherent statistical bias can be
reduced by the large size of a non-random sample so that they can become
practically acceptable and meaningful?
As
another example, as long as Figure 1 remains the typical process of conducting
statistical studies in a sequential and cross-sectional manner, there is little
room for innovative improvement to reduce turnaround time or introduce new
metrics such as measuring longitudinal change at the unit level [12]. Is it absolutely impossible to produce accurate
and reliable statistical results in real time?
Or is it because we have become so comfortable with the present
software, approach, and convenience that there is no desire to consider other
possibilities?
Random
sampling has been the dominant mode of statistical operation for a century [13]. Because of Big Data, one may now study an
entire population almost as easily as one can study a random sample today. Should we ignore this opportunity?
If
statisticians do not recognize or embrace the challenges of theory and practice
posted by Big Data as part of the core of studying and practicing statistics, the
risk is high that others including the yet-undefined “data scientists” will
fill the void [14].
Among the
many possibilities offered by Big Data is the creation and maintenance of
Dynamic Frames – population frames that are rich in content, capture the most
up-to-date data as soon as they become available, and produce results and
reports according to established schedules or even in real time.
With some
user base exceeding one billion people in membership, E-Commerce companies and
the social media are well positioned to apply their data from online
transactions, emails, and blog postings to conduct market research and perform
predictive analyses. A lay person may
also capture these data in a less structured manner.
Figure 2 provides a simple schematic on how the
Dynamic Frames may work, which are also described as longitudinal data systems
in educational applications in the U.S. [15,16]
In
essence, primary efforts are put into the creation and maintenance of the frame
so that it is optimized by the previously identified properties. It is constantly updated with new data for
every sampling unit over time.
Statisticians
must be fully engaged in the design, implementation, and operation of Dynamic
Frames, in addition to the production of descriptive and analytical results. There are many new and traditional functions
that statisticians can make major contributions.
For
example, the identification code is a key to unlocking the enormous power in
Big Data. It controls the extent
additional records and data may be linked, determines firsthand the overall
quality of data and study, and is the first safeguard to protect confidentiality.
As
another example, the size and content for the units have no conceivable limit. They depend only on availability of data,
ability to link and match records, and design of system. Effective operation minimizes mismatches of
records and collection of duplicative data that do not change or change in
predictable manner. Appropriate replacement or imputation for
missing values ensures quality and timely integration of data.
Other enhancement
of traditional statistical functions [14] include, but are not limited to, establishing
continuous quality loops back to the data sources; developing new definitions, metrics,
and standards for the dynamic frames; applying new statistical modeling for
imputation, profiling, risk assessment, and creating artificial intelligence;
developing innovative visualizations; improving statistical training and education;
and protecting confidentiality.
Summary
Dynamic frames will retain its original
purpose as a list of known units for conducting censuses and drawing random
samples as needed, but the potential use of structured Big Data is limited only
by the imagination and innovative spirit of the statistics profession. Statisticians need to embrace Big Data as its
own revolution, which will lead to the next level of human knowledge and
practice by study and use of data.
Co-authored by
Jeremy S.
Wu, Ph.D., Jeremy.s.wu@gmail.com
Junchi Guo,
Ph. D. Candidate, junchi@email.gwu.eduReferences
[1] Hansen, Morris H.; Hurwitz, William
N.; and Madow, William G. (1953). Sample Survey Methods and Theory. Wiley Classics Library Edition, John Wiley
& Sons, Inc.
[2] Kish, Leslie. (1965).
Survey Sampling. Wiley
Classics Library Edition, John Wiley & Sons, Inc.
[3] Cochran, William G. (1977).
Sampling Techniques. A
Wiley Publication in Applied Statistics, Third Edition, John Wiley & Sons,
Inc.
[4] Wikipedia. Sampling Frame. Available at http://en.wikipedia.org/wiki/Sampling_frame on April 8, 2013.
[5] Baidu.com. Sampling Frame 抽样框. Available at http://baike.baidu.com/view/1652958.htm on April 8, 2013.
[6] U.S. Census Bureau. Master Address File: Update Methodology and
Quality Improvement Program, by Philip
M. Ghur, Machell Kindred, and Michael L.
Mersch, 1994. Available at https://www.amstat.org/sections/srms/Proceedings/papers/1994_128.pdf on April 8, 2013.
[7] U.S. Census Bureau. The Master Address File for the 2010 Census,
by Joseph Salvo, April 7, 2006.
Brookings Breakfast Briefings on the Census. Available at http://www.brookings.edu/~/media/events/2006/4/07community%20development/20060407_salvo.pdf on April 8, 2013.
[8] Varian, Hal. Hal Varian explains why statisticians will
be the sexy job in the next 10 years,
September 15, 2009. YouTube. Available at http://www.youtube.com/watch?v=pi472Mi3VLw on April 8, 2013.
[9] Pierson, Steve and Wasserstein,
Ron. Big Data and the Role of
Statistics, March 28, 2012.
Available at http://community.amstat.org/amstat/blogs/blogviewer?BlogKey=737fd276-0225-4c87-b7cb-0cfc7cd9e124 on April 8, 2013.
[10] van der Lann, Mark; Hsu,
Jiann-Ping; and Rose, Sherri. Statistics
Ready for a Revolution. Amstat
News, September 1, 2010. Available at http://magazine.amstat.org/blog/2010/09/01/statrevolution/ on April 8, 2013.
[11] Washington Post. From the President’s Hand to the Internet. Available at http://www.washingtonpost.com/lifestyle/style/from-the-presidents-hand-to-the-internet/2013/03/21/0b609e66-9282-11e2-9cfd-36d6c9b5d7ad_graphic.html on April 8, 2013.
[12] Diggle, Peter J.; Heagerty, Patrick
J.; Liang, Kung-Yee; and Zeger, Scott L. (2001). Analysis of Longitudinal Data. Second Edition, Oxford University Press.
[13] Wu, Jeremy S., Chinese translation
by Zhang, Yaoting and Yu, Xiang. One
Hundred Years of Sampling, invited paper in Sampling Theory and Practice,
ISBN7-5037-1670-3, 1995. China
Statistical Publishing Company.
[14] Wu, Jeremy S. 21st Century Statistical Systems,
August 1, 2012. Available at http://jeremyswu.blogspot.com/2012/08/abstract-combination-of-traditional.html on April 8, 2013.
[15] Data Quality Campaign. Using Data to Improve Student Achievement. Available at http://www.dataqualitycampaign.org/ on April 8, 2013.
[16] U.S. Department of Education. Statewide Longitudinal Data Systems Grant
Program, National Center for Education Statistics. Available at http://nces.ed.gov/programs/slds/ on April 8, 2013.
No comments:
Post a Comment