AI applications in paleography

By MO BOFENG and ZHANG CHONGSHENG / 09-05-2024 / Chinese Social Sciences Today

AI can be of great assistance with the deciphering and understanding of ancient Chinese writings. Photo: TUCHONG


The application of AI technology in paleography has yielded many breakthroughs globally. In March 2022, DeepMind’s AI model increased the restoration accuracy of ancient Greek inscriptions to 72%. In February 2024, a team of researchers participating in a contest called the Vesuvius Challenge used AI to decipher a previously unknown text from charred ancient papyrus scripts. These achievements have not only attracted attention from the field of paleography, but also deepened the public’s understanding of “AI for Humanity.”


As AI continues to play a larger role in paleography, this interdisciplinary research is gradually becoming an independent field, with terms like “Digital Paleography” and “Computerized Paleography” emerging worldwide. Specialized fields have developed within these frameworks, such as “Digital Epigraphy” and “Machine Learning for Ancient Languages.” In China, “Computational Oracle Bone Studies” with Chinese characteristics is rising.


AI broadening paleography

Research challenges differ across the study of various ancient writing systems worldwide, necessitating the use of distinct AI techniques. The authors identify five primary areas where AI is applied in paleography as follows:


Intelligent data ingestion: Over time, most ancient manuscripts have become extremely fragile, thus digitization is optimal for both their preservation and study. Ancient texts often suffer even more than ordinary ancient books, with more serious problems such as degradation, stains, and color distortion. AI technology can effectively separate the text from the background and clarify the text through advanced image processing technology. Deciphering the Dead Sea Scrolls was once highly challenging due to the blending of original handwriting with the substrate on which they were written. However, current AI systems can achieve pixel-level content decipherment, revealing the textual content with remarkable clarity. In addition, intelligent collection systems can also automatically analyze the layout and divide the image into texts, graphics, tables, and notes of varying granularity. An example of this is the 2017 competition held at the 14th IAPR International Conference on Document Analysis and Recognition, which focused on the layout analysis of medieval Latin and Italian manuscripts. AI was used to automatically differentiate between texts, annotations, decorative images, and backgrounds. With a wide range of applications, intelligent collection and processing systems significantly enhanced data quality, facilitating scholarly research and improving the accuracy of tasks such as automatic text recognition.


AI-assisted restoration: Many ancient manuscripts are severely damaged and require extensive restoration. Manual restoration is often inefficient, whereas AI has proven to be both powerful and efficient in this regard. There are two primary aspects of AI-driven restoration. The first is the reassembly of fragmented objects. AI can mimic traditional restoration methods by using clues such as text content and morphological features to piece together fragments. Moreover, AI can detect information that may be imperceptible to the human eye, enhancing the restoration process. For example, when restoring the fragments of the Cairo Genizah, researchers used AI to extract the line height and spacing of the text, and managed to provide new clues to facilitate its restoration. The second aspect is the supplementation of incomplete materials. By studying and training on large corpora, AI can not only reassemble fragmented texts like a crossword puzzle, but also predict and fill in missing sections of text. For example, DeepMind restored ancient Greek texts by generating multiple predictions and hypotheses concerning the incomplete texts for experts to choose from. Given the vast number of ancient text fragments, intelligent restoration offers clear advantages. Compared to physical restoration, intelligent restoration methods are more conducive to the protection of cultural relics, making it an inevitable trend in the field.


Intelligent text classification: Ancient texts need to be classified in order to refine the research objects. The most widely used method is font classification, which distinguishes the writers by identifying different styles of handwriting. AI application in this area has already achieved technical maturity. For example, two scales of identification were applied to the Biblia de Avila, held by the National Library of Spain. One method focused on the characteristics of individual characters, while the other analyzed the features of an entire page of text. Combining these methods greatly enhanced identification accuracy. Beyond font classification, ancient texts themselves can also be classified. A wealth of additional information can be obtained through technical means, such as carbon-14 dating, DNA sequencing, and hyperspectral imaging. These techniques, combined with AI algorithms, can analyze and infer a text’s age, material, and places of origin. In the case of the Dead Sea Scrolls, carbon-14 dating technology together with AI algorithms has yielded more accurate and reliable age estimates. AI-driven classification and attribute inference of ancient texts can uncover insights beyond the written content, offering new research perspectives and adding depth to the field of paleography.


AI-assisted decipherment: While the decipherment of modern scripts is technically advanced, extending these techniques to ancient writing systems has long intrigued computer experts. Special models have been developed for deciphering various ancient writing systems, such as Egyptian hieroglyphs, the Mayan system of writing, writing systems of ancient India, and ancient cursive scripts, with notable success. The rise of AI deep learning technology in recent years has not only greatly improved the accuracy of deciphering ancient writing systems, but also expanded the scope to include writing systems previously unseen by these models. This represents a crucial breakthrough in paleography. By pre-training on datasets of synthetic (non-existent) writing systems and then parameter tuning and metric learning on datasets of real ancient texts, intelligent models are capable of deciphering ancient writing systems that have never been seen, which indicates great progress. These decipherment models are increasingly applied in real-world tasks. For example, intelligent models have been used to decipher and transcribe ancient archival documents in the Netherlands and other places, greatly enhancing sorting efficiency. Automatic decipherment technology has lowered the professional threshold, promoted the dissemination of knowledge and academic exchanges, and injected new vitality into humanities research.


AI interpretation and analysis: Understanding languages has always been challenging for AI, and it is even more challenging to understand ancient languages represented by ancient writing systems. Recent advancements in language understanding, driven by large language models like ChatGPT, have marked significant progress. However, ancient language models still face challenges due to the limited availability of corpus data. In previous studies, part-of-speech analysis of words within texts has been achieved, and entity types such as the names of people, places, and institutions in sentences or corpora have been identified. For example, the use of AI for part-of-speech tagging of ancient Greek, named entity recognition of ancient Korean, and sentiment analysis of zero-shot Sanskrit words have all made progress. Some intelligent models have also achieved initial success in the automatic translation between different languages, such as between ancient Greek and Latin. Language is central to a text. If AI can continue to progress in this area, it will provide a strong boost to paleography.


AI applications in ancient Chinese writing systems

AI has been applied in decipherment and interpretation of ancient Chinese writing systems, but despite its technical maturity in deciphering writing systems, ancient Chinese writing systems pose unique challenges. The diverse shapes of ancient Chinese characters, insufficient training data, and the uneven distribution of variant Chinese characters add significant complexity to the decipherment process. Chinese researchers are employing specialized methods such as data augmentation and contrastive learning to improve decipherment accuracy. Professor Li Chuntao at Jilin University and his team have undertaken large-scale tasks with a decipherment accuracy rate as high as 80.24%. In addition, there are also ongoing research efforts with “Chinese characteristics,” focusing on deciphering and interpreting unknown ancient Chinese characters. AI is being used to explore the evolution of ancient Chinese characters and identify new character shapes by simulating the methods adopted by ancient philologists, such as the method of comparing character shapes and the method of analyzing radicals. Although these methods alone are insufficient to fully resolve the complexities of character decipherment and interpretation, they demonstrate AI’s great potential.


Potential of AI in compilation of ancient Chinese texts: In recent years, AI has been effectively applied in various areas, including oracle bone rejoining, the collation of duplicated oracle bone rubbings, and the dating of bronzeware. Oracle bone rejoining refers to the work of restoring fragmented oracle bones. Traditionally, this process relies on textual clues, but AI can enhance this by overcoming traditional limitations and supporting experts in their efforts. Collating duplicated oracle bone rubbings can connect different rubbings of oracle bone scripts over time. AI can accurately compare the details of these rubbings, even when there are significant differences in completeness and clarity. This capability has led to the identification of many challenging duplicates, providing considerable value to the field. In terms of the dating of bronzeware, AI uses deep learning technology to instill expertise into intelligent models. The dating of bronzeware can be accomplished by merely taking and uploading photos to the models, which has significantly lowered the professional threshold.


Potential of AI in deciphering and understanding ancient Chinese: The development of language models for ancient Chinese has primarily relied on extant documents. With the relatively sufficient corpus of handed-down documents, language models have yielded notable success in tasks such as judou (a traditional method of punctuation used in premodern Chinese texts), named entity recognition, and the translation of ancient literary language into modern colloquial language. Refining these models can help solve specific language problems in the studies of ancient Chinese characters. These language models can be used to complete certain core tasks in the study of ancient Chinese characters. For example, by simulating the method of inferring the meaning of a character or phrase based on its context, these models can predict the meaning of an unknown character. In recent years, when studying the Chu bamboo slips from the Warring States Period housed by Shanghai Museum, researchers tested the models’ ability by obscuring certain words. The models achieved a prediction accuracy of 59% for the first 300 obscured words. Predictions on larger language units can be used to fulfill specific tasks. For example, when binding bamboo slips, the content on one bamboo slip can be used to predict the text on another.


AI continues to achieve new breakthroughs, unveiling promising future trends. One notable trend is the development of multimodal models. The recent release of GPT-4o exemplifies this trend by allowing simultaneous input and output of both texts and images. This integrated approach is particularly beneficial for paleography. Another emerging trend is AI for Research. Given the complexities and unknowns inherent in paleography, AI’s role in this field is set to be transformative. The combination of advanced AI technologies with the study of ancient writing systems holds great potential for significant discoveries and innovations.


Mo Bofeng is a professor from the Center for Oracle Bone Studies at Capital Normal University.Zhang Chongsheng is a professor from the School of Computer and Information Engineering at Henan University.


Edited by REN GUANHONG