Back to top

Guided Research Rajna Fani

Last modified by Anum Afzal Jul 23

A Human Assessment of Reference-Free and Reference-Based Evaluation Approaches in the HR Domain

 

Abstract and Motivation:

In the era of Large Language Models (LLMs), assessing the quality of generated text presents an ongoing challenge. This study explores the effectiveness of reference-free metrics in evaluating text quality produced by advanced language models, comparing them with traditional evaluation methods.

This research finds its practical application in addressing prolonged waiting times for employees seeking information from the Human Resources department through SAP HR Chatbots. By harnessing advanced text generation models, conversational agents have the potential to expedite responses and reduce the HR department's workload.

Moreover, the study examines the reliability of reference-free evaluation metrics and compares them to traditional reference-based metrics. It also assesses the performance of automatic metrics versus human evaluation by domain experts. The research evaluates two approaches, the Fine-tuned Language Model (LM) Approach and the LLM-Powered Approach, using a question-answering dataset that includes FAQs and user utterances from chatbot logs to gauge generative model performance.

Research Questions

1. What are the emerging state-of-the-art metrics in the evaluation of generative conversational agents, and how do they compare to traditional metrics?

2. Are reference-free evaluation metrics, especially those leveraging advanced language models, a more reliable indicator of a generative model's performance compared to traditional reference-based metrics?

3. How effectively do automatic metrics perform in assessing generative model performance when subjected to human evaluation by domain experts?

 

References

 

Files and Subpages