Back to top

Bachelor's Thesis Clemens Magg

Last modified Oct 21

Studying the Effectiveness of Longer Context Windows in LLMs for Summarization Tasks

 

Introduction & Motivation

 

Large Language Models (LLMs) have demonstrated remarkable capabilities in numerous natural language processing tasks, including text summarization. Processing large amounts of textual data is ever more critical in a highly digitized setting with increasingly easy access to fast amounts of information. However, identifying the critical aspects of long texts is challenging and time-consuming for humans without adequate text preparation. This is where LLMs can be used to significantly support human text understanding by harnessing their capabilities for automating text summarization. Models of various companies like OpenAI, Google, and Meta have proven to be very effective information retrievers. We will focus on how the open-source model Llama 3.1 performs in long text summarization.

Model benchmarks are a valuable tool for evaluating the performance of LLMs, ensuring accurate and consistent representations of the information provided in the input text. Many such benchmarks apply metrics like ROUGE, F1, or BLEU scores that determine how well models can imitate human written summaries, testing for lexical alignment to the reference. These metrics, however, struggle to provide a comprehensive interpretation and reconstruction of how well LLMs can utilize the information distributed over their input data. This thesis will explore alternative approaches for model evaluation for long context window summarization. We will go into more detail about how the model uses the provided information inside the context window, which information it uses, and which information it potentially neglects. Our approach will focus on testing three context window extension techniques that help LLMs process more input data: ALiBi, YaRN, and LongRoPE. We will integrate automated evaluation metrics with human evaluation to achieve a more nuanced scoring. Ultimately, our evaluation pipeline will test whether the model, in combination with the different techniques, can handle large amounts of data and examine if more information equals better summaries.

 

Research Questions

  • RQ1: What are the most effective techniques for extending the context window of LLMs?
  • RQ2: How can we adequately test the quality of text summarization of LLMs? Does the quality of the generated summary improve if more content of the article is passed?
  • RQ3: How do LLMs use the information contained in their context? Do LLMs benefit from a long context window for text summarization? Is the model able to pay attention to all parts of the document, or is it clustered toward some parts?

 

References

  1. Ding, Y., Zhang, L. L., Zhang, C., Xu, Y., Shang, N., Xu, J., ... & Yang, M. (2024). Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753.
  2. Pawar, S., Tonmoy, S. M., Zaman, S. M., Jain, V., Chadha, A., & Das, A. (2024). The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey. arXiv preprint arXiv:2401.07872.
  3. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157-173.

Files and Subpages

Name Type Size Last Modification Last Editor
Clemens Magg Bachelor's Thesis Kickoff.pdf 1,57 MB 18.08.2024