Motivation
In the digital age, data is generated and new scientific breakthroughs are made at a rapid speed. With millions of scientific articles published every year, keeping up with the latest developments and scientific discoveries in various fields is increasingly hard for both the researchers and the general public. Finding appropriate evidence for scientific claims and research hypotheses in academic databases takes researchers a lot of time and resources. Additionally, companies try to stay competitive by implementing the latest scientific advances into their industry process. Automating the task of knowledge extraction and evidence mining could facilitate their work.
The Internet has benefited the world by making knowledge easily accessible, which inevitably introduced new risks. It has become difficult to discern reliable sources from dubious content. Many scientific claims found in articles, social media posts, or news reports are not always trustworthy and backed by evidence. On top of that, not only are humans prone to creating unreliable content – the modern generative language models tend to produce convincing-sounding but factually incorrect text. All of this can easily lead to the creation and proliferation of misinformation, which has negative socioeconomic consequences.
Project Description
The research project VeriSci focuses on developing natural language processing (NLP) solutions for automated evaluation and assessment of scientific claims. The holistic process of fact verification involves detecting check-worthy claims, finding relevant documents in a corpus of articles, extracting passages containing appropriate evidence, and finally making a decision on the veracity of the claim by inferring if there is logical entailment between the claim and found evidence. To achieve these goals, the models based on different architectures will be investigated and evaluated. To increase trust in the developed machine-learning model, the mechanisms for making their decisions process interpretable and explainable to humans will be explored. In addition to the theoretical contributions of the project, a prototype system for real-time scientific claim verification will be constructed.
Owing to the complexity of language used in scientific publications, language models trained on general-purpose data can struggle with scientific text. Therefore, methods for domain adaption to scientific text will be explored. Due to the highly hierarchical nature of scientific knowledge, knowledge graphs and ontologies are commonly used to represent scientific concepts and relations between them in a structured format. These structured-knowledge resources will be explored in the project concerning their usefulness in augmenting the performance of language models. The project will also look at related tasks of natural language understanding (NLU) in the scientific domain, such as question answering, argumentation mining, and natural language inference, as well as tackling the problem of the factual correctness of automatically generated text summaries.
Research Objectives
Partners and Sponsors
The project is done in collaboration with Digital Science, which is a part of the Holtzbrinck Publishing Group. The project is part of the Software Campus Framework and sponsored by Federal Ministry of Education and Research (BMBF).