Abstractive Text Summarization for Domain-Specific Documents (ATESD)

Last modified by Anum Afzal Nov 22, 2022

Project Overview

With the ever-increasing amount of textual data being created, stored, and digitized, companies and researchers have a large corpora at their disposal that could be processed into useful information. Perusal and encapsulation of such data usually require domain experts which are both costly and time-consuming. Many companies and researchers have a huge corpora of information at their disposal, that is to say, if this information is analyzed and summarized, it has the potential to open doors for newer and faster ways of information management and analysis. Abstractive Text Summarization using Natural Language Processing (NLP) techniques, is a powerful tool that can provide aid for this task. Usually, the summarization is done by a domain expert who can understand and then extract insights from these documents. This is not efficient because of the costs involved in hiring a domain expert but also the manual repetition associated with it.

In many sectors, stakeholders rely on these documents to make critical decisions or to get a global picture of the situation.This would in some cases mean summarizing financial reports for a board member, performing risk assessment, or summarization of a medical article to understand the key findings in a research field. It allows researchers and leaders to get insights into a field without spending hours searching through all these unstructured documents.Text summarization is a technique in Natural Language Processing that understands the text using state-of-the-art algorithms, that then generates summaries that express the salient features of the text without any human input.This can be useful when the researchers want to get the main idea of a paper, or stakeholders who want to get an overview of financial or risk reports which would essentially help them make better decisions. Furthermore, this would also open new ways for people without any domain knowledge to access and understand the salient ideas of the text and perhaps also build up on it.In a nutshell, text summarization can be used to simulate the work of an intelligent analyst, reduce search time and find relevant information much faster. Text summarization of domain-specific documents as compared to summarization of general documents is more difficult and poses three main challenges. First, it involves keywords and concepts which are troublesome even for people without explicit domain knowledge. Since these keywords and concepts are not part of the model’s original training, it doesn’t perform well in the summarization of domain-specific documents. This can be solved by either employing Transfer Learning techniques or training the model from scratch on a domain-specific corpus. Secondly, state-of-the-art text summarization models suffer from the limitation of the input size of the text because they have quadratic complexity. This affects news articles, scientific articles, financial documents, or essentially any domain that deals with large documents. To tackle this, a model architecture that can easily process large documents is desired. A newer generation of models known as Efficient Transformers work towards that and can be used for a text summarization task. Lastly, language models often times hallucinate by generating text that may not be factually true.In this project, certain measures would be taken to implement fact-checking on the generated summaries to minimize model hallucination.

In a nutshell, we would adapt general purpose transformer-based language model towards being domain-specific and optimize for a text summarization task while addressing the research gaps related to model hallucinations and input size limitation.

Research Objective

Despite the great performance of general language models on benchmark datasets for text summarization tasks, generating meaningful summaries for domain-specific documents is still challenging because of the need for expert knowledge in the field and other aspects such as model hallucination and input size limitations. The goal of this project is to provide improvements over current text summarization models that provide solutions to the research gaps addressed above.In general, the project focuses on the generation of concise summaries using Natural Language Processing based techniques while addressing the following research questions:

Would domain adaptation on general-purpose language models allow them to understand the underlying concepts of the new domain?
How to adapt existing language models to ensure factual correctness in the text generated by the model?
How to overcome the input size limitation of the traditional language model without discarding meaningful data?

Research Questions

Can Efficient Transformer models encode text as effectively as the original Transformer models?
Are Efficient Transformer models able to remove the limitation on input size while ensuring linear time complexity?
Which improvements over existing model architecture would ensure factual correctness in the generated summaries?

Research Partner:

This project is part of the Software Campus Framework and fosters a research partnership with Holtzbrink Group