Data accessibility encourages reuse of scientific output
Features of Science Data Bank supported by the Chinese Academy of Sciences Photo: SCIENCE DATA BANK
Open and shared data displays features such as independent identification, attribute descriptions, guardianship mechanisms, and a traceability process. Based on the FAIR Guiding Principles (findability, accessibility, interoperability, and reusability), scholarly data ensures that research results can be verified, disseminated, and reproduced.
Necessary requirements
As for funding sources, public funds support most scientific research currently underway, so the results should reach the public free of charge. In March 2018, in its Scientific Data Management Measures, the General Office of the State Council proposed that “Scientific data formed by government budget funding should follow the principle of openness as regularity and non-openness as an exception. Management developments should organize to compile scholarly data. The data catalogues should become available on national data sharing and exchange platforms in a timely manner, giving access to society and academia.”
In 2013, the US Office of Science and Technology Policy (OSTP) instructed each federal agency with annual R&D expenditures of over $100 million to develop a plan to support increased public access to the results of research funded by the Federal Government. This included any results published in peer-reviewed scholarly publications that were based on research that directly came from Federal funds. Open access to publicly-funded research outputs supports resource sharing and social supervision, thus restricting scientific misconduct.
Both journal submission requirements and researchers’ scholarly needs call upon building scientific data repositories as platforms to secure effective management, open sharing, as well as standardized citation, publication, and dissemination.
China is a latecomer in increasing access to scientific data. Requirements in place from foreign journals and a heavy reliance on foreign data repositories have confined its efforts. In 2019, the country established 20 national-level scientific data centers in the hope of improving sci-tech resource sharing systems and making them available to public domain. This move encourages national platforms to collect scientific data in multiple fields, optimizing infrastructure for storing, governing, and safeguarding research results.
Science Data Bank (ScienceDB) is a public general-purpose data repository aiming to provide researchers with data services such as data acquisition, long-term preservation, publishing, sharing, and citation for researchers. It is constructed and maintained by the Computer Network Information Center of the Chinese Academy of Sciences (CAS).
Faster research process
Scientific data accessibility allows other researchers to cite or reproduce experiments, which helps to reduce unnecessary repeated operations, shorten research cycles, and speed up the research process for the entire field. Data concerning information science perfectly fits calls for openness and sharing.
Various algorithm competitions provide benchmark data sets, such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In 2012, AlexNet won the ILSVRC competition on image classification and object recognition algorithm based on ImageNet. It achieved immediate fame as its error rate was 10.8 percentage points lower than the second-place algorithm, instigating a wave of deep learning research based on convolutional neural networks and GPUs. The accessibility and application of benchmark data sets have driven major research progress and breakthroughs in related fields.
Scientific data accessibility can help expand the academic influence of research findings. For example, in 2016, “MODIS Daily Cloud-Free Snow Cover Products Over Tibetan Plateau” authored by several scientists, including Qiu Yubao, a research fellow at the Aerospace Information Research Institute at CAS, received thousands of visits and was republished on multiple platforms and national scientific data centers. It remained one of the most-viewed articles on ScienceDB. Positive feedback on its scholarly data poured in from many users at home and abroad. Also, researchers in similar fields conducted further use of the research results.
Accessible and shared scientific data also provides evidence for open and rational academic exchange. For example, in 2019 the journal Sociological Studies published an article titled “Housing Marketization and Housing Inequality: A Study Based on CHIP and CFPS Data.” One of its readers, Mr. Bu (a screen name), conducted a replication study of this article and publicly raised some questions. Subsequently, the author, Wu Kaize, an associate professor from the Social and Public Management at East China University of Science and Technology, responded to Mr. Bu’s questions, specifically discussing the results of data processing and model analysis.
This type of quantitative research features reproducibility. Professional readers can reproduce the research because the disputed article uses publicly accessible data gained from the Chinese Household Income Project (CHIP) and the China Family Panel Studies (CFPS). In this way, scholars can have frank talks, and further underpin research credibility.
Formal measures
Global academia is elevating scientific data’s level of openness and sharing. In 2015, the International Council for Science (ICSU), the InterAcademy Partnership (IAP), The World Academy of Sciences (TWAS), and the International Social Science Council (ISSC) issued the Open Data in a Big Data World accord, which reflects the belief that open data enhances the efficiency, productivity, and creativity of public research enterprises. In addition, the organizations agreed that concurrent open publication of the data that underpins scientific papers can provide the basis of scientific “self-correction.”
The Center for Open Science issued the Transparency and Openness Promotion (TOP) guidelines for journal publishing in terms of citation, data, codes, research materials, research design, content analysis, research pre-registration, and repeated verification. Publishers such as Elsevier, Springer Nature, Taylor & Francis, and Wiley have also formulated data sharing policies, encouraging authors to cite relevant scientific data, provide data availability statements, and store the data in proper data repositories.
Scientific data management in China is gradually being standardized. Regarding national policy, in March 2018, the Scientific Data Management Measures noted that “the department in charge should actively promote the publication and dissemination of scientific data, and support scientific research personnel to organize and publish scientific data with clear and accurate property rights and high value for sharing.” It required that “scientific data users should conform to the provisions of intellectual property rights, and explicit the scientific data they use and cite when they publish papers and monographs or apply for patents.”
In terms of journals, the country has released data journals such as China Scientific Data and Journal of Global Change Data & Discovery, and some traditional academic journals have opened special columns for data research.
Constant practices
Scientific data can promote openness and sharing in three ways. The first approach is cooperation with professional scientific data repositories on publishing sci-tech papers and their supporting data. The second way refers to independent data publication in data repositories, instead of journals. Publishing data as papers in journals is the third method.
Regarding best practices, the first method is more flexible. For example, authors can submit data sets along with papers, so that the journal panel can review papers and data sets together. Or, authors can submit data sets after their papers are accepted. Journals can review and compile data sets before final paper publication.
Linking scientific data and papers has many advantages in data sharing and data reuse, such as independent citation, independent identification, independent measurement and evaluation. In addition, open and shared data can endure when the responsibility of data entry, storage, and security management falls on professional scientific data repositories.
Today, open and shared scientific data serves as an engine to facilitate high-level collaboration, open access, data sharing, and transparency in scientific research. It is helping people tackle the challenges caused by scientific research reproducibility.
China has great potential to promote openness and share scientific data. Government management and scientific research have made unremitting efforts and harvested certain gains. The developing concept and practice of open and shared data will chart a blueprint for scientific research reproducibility.
Li Zongwen, Wang Pengyao and Jiang Lulu are from the Computer Network Information Center at the Chinese Academy of Sciences.
Edited by MA YUHONG