Demography to advance with data science in new era
Passengers have their tickets checked at Fuzhou Railway Station in Fuzhou, capital of southeast China’s Fujian Province, on Jan. 24, 2016. Big data offers a chance for demography to map out the route of population flow and demographic dynamics.
In recent years, the emergence of big data has created new possibilities for research in a variety of fields. However, demography, a discipline that relies heavily on data, has been slow to take advantage of these opportunities. Though some scholars have published research based on household registration, marriage registration, cell phone signaling data, light remote sensing data and Baidu location data, few of these studies fall under the purview of demography.
Stress on accuracy of data
In general, the status quo is somewhat related to the availability of big data, the focus of demographics and the previous training of demography scholars.
To start with, demographics has stringent standards for the accuracy of data, and in most cases, big data cannot live up to them. For example, Chinese scholars have been debating the fertility level in China for more than 20 years. They still find it hard to reach a consensus on the total fertility rate and continue to argue over one or two decimal points.
Second, it is true that in the past decade more and more information has been digitized or quantified, such as the census, household registration, marital registration, birth, education, medical care, transportation, insurance, real estate and banking, but there is a slim possibility that this personal information might become accessible to the public.
The amount of heterogeneous data and fragmented texts, pictures and videos gathered by internet companies is limited and often lacks basic social demographic variables, representativeness and accuracy. Thus, the data is unable to precisely estimate demographic size, structure and migration.
Lastly, for a long time, scholars in the field have been accustomed to relying on the aggregated data and scale sampling data published by authoritative departments, such as statistics bureaus, and health and family planning commissions. Compared with sociologists, most demography scholars are adept at handling structural data that has few variables and a simple structure, while their experience at gathering and processing irregular data, such as interview texts, is largely insufficient.
Also, demography scholars are skilled at period and cohort analysis, which is built on data of different age groups and independent representation of each age group, putting higher bar on the sample size.
More often than not, national census, regional census and population migration data from the statistical bureau, and population dynamic monitoring data, fertility rate, education, health, household registration, population summary statistics from the health and family planning offices, as well as other small-scale sample survey data organized by each governmental unit are applied in demographics research.
However, when it comes to spatial and network information in the big data era, demography scholars lack understanding and processing capacity, thus leaving these areas of research for geographical experts to fill the gap.
Main sources of big data
Nowadays, big data could be grouped into two main categories. First, it includes basic registration records from various government units and public sectors. This data usually contains abundant demographic and social attributes. If it become available, it would greatly help demography scholars to better study birth and death, migration and behavioral activities, and also to narrow the research focus, so as to achieve a breakthrough in the long-standing contradiction of sophistication and scope.
However, at the moment, this data is only accessible to a small group of scholars. It is a waste to let this data sit there and become obsolete.
Second, it refers to the new type of data generated by portable smart devices, such as internet trace data, GPS positioning and cell phone signaling, which indicates the dynamic space-time location and behavioral information of the population. However, it cannot easily form accurate matching with basic population and social information.
A further study of this type of data can help demography scholars grasp the population distribution and flow within a certain space-time range, and to enhance understanding of demographic dynamics, and again these topics often go beyond traditional demography. At present, it is fairly difficult to carry out in-depth interdisciplinary analyses of this data, which is why demography scholars should face these challenges with scholars of other fields.
Equipped with theoretical knowledge and skills, demography scholars must strive to get their hands on these two types of big data through cooperation with the public and private sectors, which requires them to learn to use large-scale databases, and extract and process new data, such as matching methods of various irregular data and basic data.
For example, matching mobile phone number, machine identification number with age, gender, household registration place and birthplace information extracted from the identity card is a basic skill. Also, they should learn to use electronic trace data to identify user’s gender, age, occupation, family structure, permanent residency, work place and other labeling information.
As companies enhance awareness of data assets and the public yearns for privacy protection, demography scholars must take on an advantageous position in the division of labor to get more access to this data.
Some scholars have been collaborating with the Information Center of Beijing Civil Affairs Bureau to analyze the pattern of marriage as well as population structure and dynamics through the marriage registration data, offering a window through which to learn about the evolution of the registered population and permanent residents in Beijing. This result could serve as the evidence for further population regulation and policy adjustment.
In this light, it is still worthwhile to explore how to transform individual resources and social networking into the advantages of the discipline.
In practice, it might be difficult to work with national bureaus. Scholars could try regional units and departments to focus on a certain area. In the past, due to low local economic strength and poor database construction, many scholars had to use national data for local studies.
Now, it is quite a different story. As regional competition becomes more intense, many local governments have recognized the importance of population resources and human capital, so they want to learn about the heterogeneity of the population and its impact, thus providing a greater basis for demographics research.
Ride with advantage, innovation
Going forward, demography scholars need to fully utilize the discipline’s advantages with the central task of promoting the development and use of big data: One is to provide authoritative basic data for big data calibration, and the other is to introduce mature demography theories and methods.
In this regard, Baidu Huiyan, a data analysis platform, set a good example by applying basic household registration data to calibrate the estimate small-scale population in Ningbo and Hangzhouwan regions in Zhejiang Province based on location mapping data. It has shed some light on the representativeness and accuracy of new data and provided a basis for the future development of new data.
The basic techniques of demographics include choosing the measurement or summary index, determining the model and setting the parameters, which could all be of great help to big data analysis. Demography scholars should let relevant parties, especially the data holders, come to know the value of the discipline.
Finally, innovative thinking is essential for demography scholars. Though the current big data and new data lack representativeness and micro-accuracy, they are usually able to deliver data in a timely fashion and have a massive sample base, so they have validity and reliability in the area or population summary index. With these indexes, the structural characteristics and changing pattern of the sample could be discovered.
For example, in traditional demographic data, spatial data is scarce and it is difficult to visualize the spatial distribution of a population. Today, mobile phones and smart devices provide fairly accurate demographic information, offering a better estimation of spatial distribution, variability and demographic composition of certain populations in a given space.
It is worth noting that not everyone uses cell phones and smart devices, and the data may have structural biases, such as underrepresentation of older people and children, but it is still an important reference. Through proper calibration, the accuracy can be further improved. This requires scholars in demography to acquire the relevant analytical and calibration techniques.
All in all, demography scholars have a long way to go to find a place for themselves in the era of big data. One first step could be to start with regional and specific projects to further cooperation and collaboration with other disciplines, and public and private sectors.
Li Ding is from the National Academy of Development and Strategy at Renmin University of China.