Developing high-quality structured humanities knowledge bases
FILE PHOTO: The China Biographical Database (CBDB) hosted by Harvard University
Digital and intelligent technologies have introduced new methodologies to traditional humanities disciplines, giving rise to new interdisciplinary fields represented by digital humanities. Breakthroughs have been made in the digitalization, quantification, and visualization of humanities materials.
In the past, the humanities often struggled with a misunderstanding of knowledge structuring, assuming that digitalization was complete once paper materials were scanned into electronic documents or when artifacts were scanned and modeled in 3D. However, scanning merely constitutes data collection. Knowledge structuring involves not only collecting data but also converting data into conceptual nodes and establishing clear, effective relationships between these nodes.
Concepts include people, artifacts, events, time, and places. Relationships encompass interpersonal relationships, object attributes, and people-place relationships. Conceptual nodes and relationships form different structures such as linear structures, hierarchical structures, and networked structures. Efficient retrieval, statistical analysis, and reasoning can only be achieved through highly structured humanities knowledge bases.
Knowledge structuring
The China Biographical Database (CBDB), hosted by Harvard University, is a good example of humanities knowledge structuring. Covering over 530,000 biographies from Chinese history, this large-scale knowledge base provides information about more than 640,000 individuals, such as dates of birth and death, kinship, social relations, postings to office, places in people’s lives. Traditionally, experts would spend tremendous effort on organizing materials as well as wording and phrasing in order to compose a brief biography for each historical figure, which is suited to qualitative research.
By contrast, CBDB is a structured database that establishes relationships among various attributes of individuals by means of triplets such as <Person, Birth year, 3rd year of Hongwu>, <3rd year of Hongwu, Conversion to AD, 1370>, and <Person 1, Father-Son Relationship, Person 2>. This enables various forms of quantitative analyses, including gender ratio, average lifespan, average age at first marriage, size of an individual’s social network, places that an individual visited. The benefits of structuring humanities knowledge are two-fold: it not only allows for statistical analyses after completion, but also reveals problems such as errors, inconsistencies, ambiguities, and missing data during the process.
Reasoning
While the concept of “family” is not present in the original CBDB data, it contains over 400 kinship terms, such as “father,” “eldest son,” and “youngest son,” which are not directly usable for determining families. However, kinship triplets such as <Person 1, Kinship, Person 2> and <Person, Sex, Male> can be created using three types of relationships: father, mother, and spouse. On this basis, the hierarchical structure of patrilineal families can be delineated with the help of tree-structure algorithms. Once the concept of “family” is established, the size of ancient extended families, their generational continuity, and intermarriage patterns among them can be analyzed to explore their opposition and cooperation patterns. Likewise, geographic information systems can be used to map ancient place names onto modern geographical coordinates to show whether ancient cities were mostly near rivers or near valleys, and where ancient relay stations were typically located.
Challenges
At present, defining concepts and relationships is the greatest challenge in structuring humanities knowledge. For instance, as the official system continually evolved and varied across Chinese dynasties, appropriate conceptual systems and relationship triplets need to be created to represent the connections between different official positions. “Events” is currently the most difficult concept to process, because a major event may consist of several smaller events, which can be further broken down into even smaller ones, and elements such as involved parties and time also vary from event to event.
These issues should be addressed through collaboration between humanities scholars and computer scientists. On one hand, computer scientists often lack the profound humanities knowledge necessary for making qualitative judgments. On the other hand, high-quality multilingual data on ancient texts and ancient knowledge bases remain scarce.
In the future, the development of structured humanities knowledge bases can be enhanced by building knowledge bases each covering a specific historical period and knowledge area, which can be further integrated into international, comprehensive, multilingual humanities knowledge platforms. This may lead to more humanities research that combines macroscopic and microscopic perspectives, as well as qualitative and quantitative approaches. Moreover, high-quality structured humanities knowledge bases have a wide range of applications in fields such as humanities education, intercultural communication, and science communication.
Li Bin is a professor from the School of Chinese Language and Literature at Nanjing Normal University.
Edited by WANG YOURAN