Big Data promises to improve governance of society and
better inform the public in the 21st century. Although every data record has some
information to contribute, linking and merging relevant electronic records can minimize
the collection of duplicate data and increase the value and utility of the
integrated data rapidly and exponentially. Essential in this approach is the presence of
identification codes that will facilitate the actual integration of record and
data. The identification code is a key
to unlocking the enormous power in Big Data.
However, it may also be the primary cause of system failures, misuses
and abuses, and even fraudulent or criminal activities, if it is not properly
applied and managed. In addition to
technology, statistical design and quality feedback loops, proper education and
training, relevant policies and regulations, and public awareness are all needed
for the effective and responsible use of identification codes and Big Data.
The
Need for Identification Codes
When a student enters a school, a record
will store the student’s name, gender, age, family background, field of study,
and other data. When she takes a course
and receives a final grade, the results are recorded. When she satisfies all the requirements for graduation,
another record will show the grade point average she has achieved and the
degree she is awarded.
Each record represents a snapshot for
the student. The records are collected over
time for administrative purposes. Together
the longitudinal snapshots provide comprehensive information about the education
of a student.
When the student enters the workforce,
additional data are collected over her lifetime about the industry and
occupation she works in, the job she performs, the wages and promotions she
receives, the taxes and insurances she pays, and the employment or unemployment
status she is in.
In like manner, massive amounts of data
are collected about a firm, including its initial registration as a business, periodic
reports on revenues and expenses, entry into the stock markets, acquisitions or
mergers with other companies, payment of taxes and fees, growth in sales and
staffing, and expansion or death of the business.
These administrative records used to be
stored in dusty file cabinets, but they are now mostly digitized and available
for computer processing when the Big Data era arrived at the turn of the
millennium.
Timely and proper integration of the
records of all students would provide unprecedented details about how the
school is performing, such as its graduation or dropout rate over time. Further roll up of all schools would inform a
nation about its state of education, such as its capacity to support employment
and economic growth. This is what Big
Data promises to bring in the 21st century. From allocation of resources, measurement of
performance, to formulation of policy, every segment of society can benefit
from the details and insights of Big Data to improve governance and inform the
public.
Although every data record has some
information to contribute, linking and merging relevant electronic records minimizes
the collection of duplicate data and increases the value and utility of the
integrated data exponentially. Essential
in this approach is the presence of identification codes that will facilitate
the actual integration of record and data.
Statisticians can make significant contributions to building new
statistical systems with their thinking and methods in this process.
Types
of Identification Codes
The name of an individual or a company was
the preferred identification code when files were still physical, such as in
paper form. It has been conventional to
consolidate records under the same name and sort them by alphabetical order in
English, number of strokes in Chinese, or chronological order.
However, a major shortcoming of using
names, especially when processed massively by computer, is that they are not
unique. The top four family names of
Lee, Wang, Zhang, and Liu accounted for 334 million individuals in China in
2006 [1], exceeding the total U.S. population.
Chinese names may also appear differently because of the simplified and
traditional characters. The English first
name of Robert, the 61st most popular male name at birth in the U.S.
in 2011 [2,3], can have at least 7 common variations for the same person,
including Bert, Bo, Bob, Bobby, Rob, Robbie, and Robby. Bert may also be short for Albert. Individuals may apply to change their names
or use more than one name; women may change their names after marriage. Human errors can add errant names. References to the same name across nations
with different languages can be notoriously difficult.
The name of a company is usually checked
and validated to avoid duplication during the registration process and
protected by applicable local, national and international rules and laws
including trademarks after registration.
The company may use multiple names including abbreviations and stock
market symbols; it can also change its name due to merger with another company,
acquisition agreement, reorganization, or a simple desire to change its brand.
A non-unique identification code poses
the risk of linking and merging records incorrectly, leading to incorrect results
or conclusions. Supplementing a name
with auxiliary information, such as age, gender, and an address, would reduce but
not eliminate the chance of record mismatches, and at the cost of increasing processing
time.
A series of numbers, letters and special
characters (alphanumeric) or a series of numbers alone (numeric) is
increasingly used as the identification code of choice with modern machine sorting,
linking, and merging of electronic records.
Numeric codes tend to be less restrictive because they are independent of
the writing system. Alphanumeric codes
using letters from the English alphabet may be suitable for systems using
languages based on the Latin alphabet, but systems using non-Latin scripts may still
find them unavailable or difficult to use, understand, or interpret. It is also easier to understand how numeric
codes are sorted compared to alphanumeric codes.
When the Social Security Act of 1935 was
passed in the U.S., one of the first challenges in implementation was to create
an identification code that would “permanently identify each individual to be
covered” and “be sufficiently elastic to function indefinitely as additional
workers became covered” [4]. An 8-field
alphanumeric code was initially chosen, but it was soon rejected by the
statistical agencies, as well as labor and justice departments. This exchange was described [4,5] as the
first sign of “the tremendous impact machines would have on the way
[government] would do business.” This
was BEFORE
computers were introduced for actual use.
Today, the impact of information
technology is obvious and continues to increase in every aspect of government,
business, and individual activities. An
identification code may be applied to a person, a company, a vehicle, a credit
card, a cargo, en email account, a location, or just about any practical entity.
An electronic record that does not
contain an identification code or cannot be correctly linked with other records
may be described as lacking in “structure” or “unstructured” in the Big Data
era. Since the beginning of the 21st
century, “unstructured” data are occurring in much higher frequency than
“structured” data. However, they contain
relatively limited information content and utility compared to “structured”
data, especially for continuing, consistent, and reliable information about a
society or an economy over time.
Effective use of the identification code
is a key to unlocking the enormous power inherent in Big Data.
Effective
Use of Identification Codes
1.
Match and Merge Records. Ideal identification codes are mutually
exclusive and exhaustive, establishing an unambiguous one-to-one relationship
between the code and the entity, including those yet to appear in the future. The code facilitates direct and perfect machine
sorting, matching, and merging of electronic records, potentially increasing
the amount of information about the entity with no limit.
2.
Anonymize and Protect Identity. A code offers the first-line protection of identity by
anonymizing the entity. Due to the
increasing importance of the code and the relative ease of linking with other
records, the risks and stakes of identity fraud or theft through the
identification code have also risen, requiring responsible policy and
management of the code as safeguards.
3.
Provide Basic Description and Classification. An
identification code can provide the most basic description of the content and
context of the data records, from which simple observations or summaries can be
quickly derived. Over time, this concept
also evolved into codes for classification and the separate development of
“metadata” [6,7] for efficiently building structure into data systems and
broadening their use across systems.
4.
Perform Initial Quality Check. Unintentional human errors in typing or transcribing
an identification code incorrectly may damage the quality of integrated data
and the eventual analytical results. Fraudulent
or malicious altering of the identification codes may inflict even more severe
damage to the integrity and reliability of the data. Early detection with the deployment of “check
digit” [8,9] in the identification code may eliminate more than 90 percent of
these common errors.
5. Facilitate Statistical Innovations. By collecting
and integrating data continuously for each entity such as a student, a dynamic
frame with rich content can be built for all students and all schools. New data elements may be defined for
analysis; statistical summaries may be produced in real time or according to
set schedules to describe the performance of a school or the state of education
for a nation, while strictly protecting the confidentiality of individuals and security
of their data. Innovative efforts to
construct these dynamic frames, or longitudinal data systems, have started in
the U.S. and China [10]. The Data
Quality Campaign [11] lists “a unique statewide student identifier that
connects student data across key databases across years” to be the top
essential element in building state longitudinal data systems for education in
the U.S.
Personal
Identification Codes of the U.S. and China
The U.S. does not have a national
identification system. The Social Security Number (SSN) was created to track
earnings of workers in the U.S. in 1936, before computers were introduced for commercial
use. Its transition into the computer
age revealed some of the strengths and weaknesses of its evolving role as an
identification code.
The 9-digit SSN is composed of 3 parts:
- Area Number (3 digits) – initially geographical region where the SSN was issued and later the postal area code of the mailing address in the application
- Group Number (2 digits) – representing each set of SSN being assigned as a group
- Serial number (4 digits) – from 0001 to 9999
Demographic data are collected in the
SSN application [12], including name, place of birth, date of birth,
citizenship, race, ethnicity, gender, parents’ name and SSN, phone number and
mailing address. The U.S. Social
Security Administration is responsible for issuing the SSN. Some of the SSN are reserved and not
used. Once issued, a SSN is supposed to be
unique because it would not be issued again.
However, some duplicate situations exist.
A wallet manufacturer decided to promote
its product in 1938 by showing how a copy of a Social Security card from one of
its employees would fit into its wallets, which were sold through department
stores [13]. In all, over 40,000 people mistakenly
reported this to be their own SSN, with some as late as in 1977.
Use of the SSN by the government and later
the private sector has expanded substantially since its creation. Beginning in 1943, federal agencies were
required by executive order to use the SSN whenever the agency finds it
advisable to establish a new system of permanent account numbers for
individuals [5]. In the early 1960s,
federal employees and individual tax filers were required to use SSN. In the late 1960s, SSN began to serve as
military identification numbers.
Throughout the 1970s when computers were increasingly used, SSN was
required for federal benefits and financial transactions such as opening bank accounts
and applying for credit cards and loans. Beginning in 1986, parents must list the SSN
for each dependent for whom the parents want to claim as a tax deduction. The anti-fraud change resulted in 7 million
fewer minor dependents being claimed in the first year of implementation [14].
As SSN became essentially an unofficial
national identifier that can link and merge many electronic files for the same
person together, it can also be the direct cause of misuse and abuse such as
identity theft [15]. The SSN does not
have a check digit; it cannot be used reliably for authentication of identity. Academic researchers have also demonstrated
ways to use publicly available information “to reconstruct SSN with a startling
degree of accuracy” [16]. These identified
vulnerabilities have resulted in more cautious, secured, and responsible use of
the SSN in the U.S. in recent years. The
original 1943 executive order requiring the use of SSN was rescinded and replaced
by another executive order in 2008 that makes the use of SSN optional.
China had a relatively late start in
personal identification codes. It
revised the Resident Identification Number (RIN) from 15 digits to 18 digits on
July 1, 1999, raising the embedded birth year from 2 to 4 digits and adding a
check digit. The 18-digit RIN is
composed of 4 parts [17,18]:
- Address Area Number (6 digits) – administrative code for the individual’s residence
- Birthdate Number (8 digits) – in the form of YYYYMMDD where YYYY is year, MM is month and DD is day of the birthdate
- Serial Number (3 digits) – with odd numbers reserved for males and even numbers reserved for females
- Check Digit (1 digit) – computed digit based on 17 previous digits using the ISO 7064 standard algorithm [18,19]
Security offices at county-level local
governments issue the resident identification cards to individuals upon
application no later than age 16. Data
collected include name, gender, race, birthdate, and residential address. The resident identification cards may be
valid permanently or for a time period as short as 5 years, depending on the
age of the applicant. According to official
announcements, the RIN is also used to track individual health records in the
National Electronic Health Record System in China [20].
Business
Identification and Industry Classification Codes of the U.S. and China
An Employer Identification Number (EIN)
to a business is equivalent to the SSN to an individual in the U.S. [21]. However, a business in this case may also be a
local, state, or federal government; it may also be a company without employees
or an individual who has to pay withholding taxes on his/her employees. The EIN is a unique 9-digit number assigned
by the U.S. Internal Revenue Service (IRS) according to the GG-NNNNNNN format,
where GG was a numerical geographical code to the location of the business
prior to 2001 and the remaining 7 numeric digits have no special meanings. Once issued, an EIN will not be reissued by
IRS. In addition, each state has its own,
different Employer Identification Number for its tax collection and
administrative purposes.
Information collected about the business
during the EIN application process include legal name, trade name, executor
name, responsible party name, mailing address, location of principal business,
type of entity or company, reason for application, starting date of business,
accounting year, highest number of employees expected in the next 12 months,
first date of paid wages, and principal activity of business [22].
U.S. statistical agencies use the North
American Industry Classification System (NAICS) to classify business
establishments for the purpose of collecting, analyzing, and publishing
statistical data related to the U.S. economy [23]. NAICS was adopted and replaced the Standard
Industrial Classification (SIC) system in 1997.
NAICS is a hierarchical classification coding
system consisting of 2, 3, 4, 5, or up to 6 numeric digits. The top 2-digit codes represent the major economic
sectors such as Construction and Manufacturing. Each 2-digit sector contains a collection of
3-digit subsectors, each of which in turn contains a collection of 4-digit
industry groups. For example, 31-33 is
the Manufacturing sector for which the following hierarchy exists for the Rice
Milling industry:
311 Food Manufacturing
3112 Grain and Oilseed Milling
31121 Flour Milling and Malt
Manufacturing
311212 Rice Milling
One of the strengths of the hierarchical
system is that aggregation can be performed easily up the chain. For example, sum of all 311X companies should
form the 311 Food Manufacturing industry in the U.S.
Consistent creation and assignment of
NAICS codes is a challenge in a global, dynamic economy where obsolete
industries may disappear and new industries may spawn and grow overnight. Examples of challenging industries include
“high technology” industries in the past and the recent “green”
industries. Application of the NAICS codes
is subject to interpretation and consistency issues. For example, the U.S. Census Bureau and the
U.S. Bureau of Labor Statistics disagree in creating and maintaining their
respective business frames due to differences in data sources and assignment of
NAICS codes [10]. Inconsistent use of
NAICS codes disrupts or even invalidates analysis and interpretation of time
series or longitudinal data.
A new business in China must apply to
the local Quality and Technical Supervision Office for a 9-digit National Organization
Code, which contains 8 digits and 1 check digit [22]. The Chinese regulation, GB 11714-1997 on
Rules of Coding for the Representation of Organizations, is patterned after
international standards, ISO 6523 Information Technology – Structure for the
Identification of Organizations and Organization Parts [25]. Online directories exist to look up
information about the organization based on the National Organization Code [26].
The value of the Chinese Industrial
Statistical Dataset is well recognized by economists and other analysts
domestically and internationally.
Substantial resources were invested into the construction and
maintenance of the comprehensive data system that describes almost all
state-owned and large enterprises (annual sales of over RMB Ұ5 million until
2010 and over RMB Ұ20 million thereafter) in China longitudinally since
1998. However, serious data quality
problems have been reported, and the primary cause can be traced to the
inconsistent and incorrect application of the identification codes [27]. This situation exists although China started
its standardization of organization codes in 1989 and is currently in the third
phase of implementation [28].
As recently as last month, Guangdong
province has announced its commitment to use a shared platform on the National
Organization Codes as part of its campaign to combat corruption [29].
China also has a standard industry
classification system under GS-T4754-2002 [30].
The hierarchical system has 4 categories with the highest level
indicated by a 1-digit letter, and the lower levels represented by 2, 3, and 4
digits respectively. For the previous
example of Rice Milling, the Chinese classification system provides the
following hierarchy:
C Manufacturing
C13 Food Manufacturing
C131 Grain Milling
C1312 Rice Milling
Summary
As technology continues to evolve and
grow, larger amount of digitized data will be collected more rapidly at
relatively low cost. This has
characterized the Big Data era.
These Big Data contain unprecedented
amount of information. If integrated and
structured, their value and power will be increased exponentially beyond any
existing statistical systems have been able to provide. Identification codes that facilitate linking
and merging of records hold the key to unlocking this enormous trove of
opportunities.
As the gateway to the enormous power of
Big Data, identification codes may also be the primary cause of system failures,
misuses and abuses, and even fraudulent or criminal activities, if they are not
properly applied and managed.
The practical challenges of applying an
identification code are complex. In
addition to technology, statistical design and quality feedback loops, proper
education and training, effective policies and regulations, and public
awareness are all needed for the effective and responsible use of
identification codes. These topics will
be discussed in future papers.
References
[1] 360doc.com. Quantitative Ranking of Chinese Family Names
(中國姓氏人口數量),
November 25, 2012. Available at http://www.360doc.com/content/12/1125/17/6264479_250155720.shtml on April 29, 2013.
[3] U.S. Social Security
Administration. Change in Name Popularity. Available at http://www.ssa.gov/OACT/babynames/rankchange.html on April 29, 2013.
[4] U.S. Social Security Administration. Fifty Years of Operations in the Social
Security Administration, by Michael A. Cronin, June 1985. Social Security Bulletin, Volume 48, Number
6. Available at http://www.ssa.gov/history///cronin.html on April 29, 2013.
[5] U.S. Social Security
Administration. The Story of the Social Security
Number, by Carolyn Puckett, 2009.
Social Security Bulletin, Volume 69, Number 2. Available at http://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html on April 29, 2013.
[7] Wikipedia. 元数据. Available at http://zh.wikipedia.org/wiki/%E5%85%83%E6%95%B0%E6%8D%AE on April 29, 2013.
[8] Wikipedia. Check
Digit. Available at http://en.wikipedia.org/wiki/Check_digit on April 29, 2013.
[9] Wikipedia.
效验码. Available at http://zh.wikipedia.org/wiki/%E6%A0%A1%E9%AA%8C%E7%A0%81 on April 29, 2013.
[10] Wu, Jeremy S. 21st Century Statistical Systems, August 1, 2012. Available at http://jeremyswu.blogspot.com/2012/08/abstract-combination-of-traditional.html on April 29, 2013.
[11] Data Quality Campaign. 10 Essential Elements of a State Longitudinal Data System. Available at http://www.dataqualitycampaign.org/build/elements/1 on April 29, 2013.
[12] U.S. Social Security
Administration. Application for a Social Security
Card, Form SS-5. Available at http://www.ssa.gov/online/ss-5.pdf on April 29, 2013.
[13] U.S. Social Security Administration. Social Security Cards Issued by Woolworth. Available at http://www.socialsecurity.gov/history/ssn/misused.html on April 29, 2013.
[14] Wikipedia. Social Security Number. Available at http://en.wikipedia.org/wiki/Social_Security_number, on April 29, 2013.
[15] President’s Identity Theft Task
Force. 2007. Combating Identity Theft: A Strategic Plan. Available at http://www.idtheft.gov/reports/StrategicPlan.pdf on April 29, 2013.
[16] Timmer, John. New Algorithm Guesses SSNs Using Data and Place of Birth, July 6, 2009. Available at http://arstechnica.com/science/2009/07/social-insecurity-numbers-open-to-hacking/ on April 29, 2013.
[17] baidu.com. GB11643-1999 Citizen Identity Number 公民身份号码. Available at
http://wenku.baidu.com/view/4f19376348d7c1c708a14587.html on April 29, 2013.
[18] Wikipedia. Resident Identity Card. Available at http://en.wikipedia.org/wiki/Resident_Identity_Card_%28PRC%29 on April 29, 2013.
[19] Wikipedia. ISO 7064. Available at http://en.wikipedia.org/wiki/ISO_7064:1983 on April 29, 2013.
[20] baidu.com. Electronic Health Record 电子健康档案. Available at
http://wenku.baidu.com/view/348d5a18a300a6c30c229fec.html on April 29, 2013.
[21] Wikipedia. Employer Identification Number. Available at http://en.wikipedia.org/wiki/Employer_identification_number on April 29, 2013.
[22] U.S. Internal Revenue Service. Form SS-4: Application for Employer
Identification Number. Available
at http://www.irs.gov/pub/irs-pdf/fss4.pdf on April 29, 2013.
[23] U.S. Census Bureau. North American Industry Classification
System. Available at http://www.census.gov/eos/www/naics/index.html on April 29, 2013.
[24] National Administration for Code
Allocation to Organizations. Introduction
to Organizational Codes, 组织机构代码简介. Available at http://www.nacao.org.cn/publish/main/65/index.html on April 29, 2013.
[26] National Administration for Code
Allocation to Organizations. National
Organization Code Information Retrieval System, 全国组织机构信息核查系统. Available
at http://www.nacao.org.cn/ on
April 29, 2013.
[27] Nie, Huihua; Jiang, Ting; and
Yang, Rudai. A Review and Reflection on the
Use and Abuse of Chinese Industrial Enterprises Database. World Economics, Volume 5, 2012. Available at http://www.niehuihua.com/UploadFile/ea_201251019517.pdf on April 29, 2013.
[28] National Administration for Code
Allocation to Organizations. Historical
Development of National Organization Codes, 全国组织机构代码犮展历程. Available
at http://www.nacao.org.cn/publish/main/236/index.html on
April 29, 2013.
[29] National Administration for Code
Allocation to Organizations. Guangdong
Aggressively Promotes the Use of identification Codes in its Campaign against
Corruption, 广东积极发挥代码在反腐倡廉中的促进作用, March 7, 2013. Available
at http://www.nacao.org.cn/publish/main/13/2013/20130307150216299954995/20130307150216299954995_.html on April 29, 2013.
No comments:
Post a Comment