Corpora, audio databases open new doors for linguists

By Zhang Jie and Zhang Qingli / 07-26-2013 / Chinese Social Sciences Today

The year of 2013 is the fifth anniversary of the "Audio Database Project for China's Languages and Linguistic Resources" since its inception in 2008.

This year marks the fifth anniversary of the “Audio Database Project for China’s Languages and Linguistic Resources”. On June 5th, the Ministry of Education and the National Languages Committee jointly issued the 2012 Status Report on China’s Language. The report appraised China’s language resource construction as having entered a state of steady progress. Academic circles in no small part are responsible for this development, with institutions like the Chinese Academy of Social Sciences (CASS), and universities like Communication University of China and Beijing Language and Culture University providing intellectual support for the project.

China highly values endangered language protection

According to statistics from a 2008 survey by UNESCO, more than 7,000 languages are spoken around the world today. Experts have estimated that more than 50% of these languages are endangered.

“China has more than 100 languages, 30 written languages and numerous dialects,” said Yi Jun, the director of the Office of Coordination of the Department of Language Information Management at the Ministry of Education. “These languages and dialects embody the ethnic and regional history of the areas they originate from and are used in, becoming vehicles for the transmission of the rich and varied cultural information.”

The importance of language protection and development has not been lost on the Chinese government, which listed “the scientific protection of ethnic languages and scripts” as one of seven major issues for the country’s overall language agenda until 2020, and beyond.  

“Using technology to preserve Chinese languages and spoken dialects is an integral measure for achieving the scientific protection of ethnic languages and scripts,” Yi said. “The Audio Database Project for China’s Language Resources is itself this kind of measure; its objective is to investigate, organize, study and develop those endangered dialects.”

In 2004, the Department of Language Information Management at the Ministry of Education co-established the National Language Resource Monitoring and Research Center with Beijing Language and Culture University, Central China Normal University, Xiamen University, Jinan University, Communication University of China and Minzu University of China. The research center is a hub for the broad based construction of language resources, covering all the main ways languages are used and being equipped with a system to monitor the vital signs of different languages. Each year it issues The Annual Status Report on Languages to keep the public informed about and attuned to China’s and aware of the value of Chinese language resources, and to facilitate the construction of a harmonious language situation in China.

Gu Yueguo, the research director of the Office of Contemporary Linguistics of the Institute of Linguistics at the CASS, praised his office’s construction of the “Multimodal Corpus of Impromptu Discourse” (a language database containing records of speech and conversation in different media), pointing out that it enables closer study on impromptu discourse, an area that is traditionally hard for linguistics researchers to access readily.

“The development of modern technologies, especially equipment such as audio recording and video recording, makes us capable of sampling and permanently preserving impromptu discourse in its original conditions. We are able to observe the linguistic factors and cultural contents,” Gu said. Since they have become digitized, audio recording and video recording have become much more manipulable and precise. One-second-long video can be divided into 20 to 30 frames and a one-second-long audio record can be broken down into 1,000 bits.”

Modern technologies help researchers observe interpersonal communication both visually and auditorially, so their analysis can extend to verbal and non-verbal information. Proceeding from this enriched analytical framework has enabled them to study the relation between linguistic and social cultural circumstances.

Detailing some of the developments at Communication University of China, Hou Min, a professor at the university’s National Broadcast Media Language Resources Monitoring and Research Center said they proposed the idea for a language monitoring system, which defines a series of requirements in corpora statistics and other linguistic computational methods for language monitoring. For language monitoring, they have also developed an automatic word segment labeling system, which can search and automatically detect buzzwords, new words and alphabetical words. The improvement in technology has really expanded the scope of language monitoring capabilities, Hou said. Before, systems could just monitor particular words; now they can monitor media topics, popular news and public sentiment.

Demand for data in the age of Big Data

Language is not a homogenous system; it is intrinsically discrepant and constantly evolving. Linguistic elements and the cultural contents they contain will be washed away in relatively short period of time.

In preserving linguistic elements, constructing language resources has enabled the preservation of language structure and the comparative studies of language, said Sheng Yulin, a member of the team working on the Audio Database for China’s Language Resources and a professor of the School of Literature and Journalism at Shandong University. Sheng noted that prior studies in linguistics tended to be “mouth to ear”, while scholars worked in the dark as to what other scholars were researching.  The construction of audio corpora has changed that however, enabling different scholars to study and examine the same linguistic element. This has laid the foundation for experimental phonetics. The combination of language corpora with language and culture expands linguistic typology, Sheng said. Corpora help facilitate comparative study between local dialects, and between dialects and common language. Using corpora, it is easy to discern changes in a language and the rules of language dissemination. 

Before linguists had this sort of technology, they could not follow dynamic trends in the languages the studied, Gu Yueguo commented. The modern corpus has expanded really extended what linguists can do, Gu said, in particular pointing out that it is now much easier to understand the phonetic rules of different languages. Additionally, multimodal language studies have deepened our understanding of the relationship between behaviors and linguistic activities, he said.

Cao Zhiyun, vice-president of Beijing Language and Culture University, observed that with the arrival of the age of Big Data, linguistics too demands data on a much larger scale. The construction of an audio database of China’s language resources is definitely step in this direction, Cao said, indicating he thought the database’s use will exert a historical influence.

Language is part and parcel of ethnicity, Sheng Yulin said. These databases will have great value as a fundamental tool for anthropology, ethnology and cultural studies.

 

The Chinese version appeared in Chinese Social Sciences Today, No. 466, Jun. 24, 2013

                                                                                                                           Translated by Zhang Mengying