Back to top

CreateData4AI (CD4AI)

Last modified Apr 12, 2023

CreateData4AI: Leveraging Domain Knowledge and Context Rules to Transform Large-Scale Unstructured Text Corpora into Structured and Annotated Datasets


An estimated zettabytes of data are generated every day, with about 80% of this data being unannotated, unstructured text. An as of yet unsolved problem with this type of data is how to make it useful for AI applications. Manual annotation of the data can be very precise and incorporate domain-specific knowledge, but it is costly, inefficient, and not scalable. The so-called "80/20 rule" refers to the fact that data scientists often spend up to 80% of their time sorting, cleaning, and otherwise preparing datasets. This project aims to develop a novel hybrid framework that helps domain experts annotate text using Natural Language Processing algorithms, reducing the process to a fraction of the time. The hybrid framework will enable data scientists to create customized, domain-specific datasets for their AI applications in a short time. Especially small and medium-sized companies with only a few employees are thus supported in the development of their own AI applications.

The proposed approach is structured as a pipeline of multiple sub-tasks, all of which will leverage modern Natural Language Processing techniques to infuse domain knowledge into the dataset creation process. Starting with a corpus of unstructured (text) documents, the goal is to create meaningful datasets with defined classes (features). To accomplish this, the steps are as follows:

  1. Keyword Extraction: using classes or tags defined by a domain expert, unsupervised techniques for the extraction of keywords will be utilized to support the domain expert in defining the class. Moreover, related words and phrases are suggested to refine further the scope of the class.
  2. Context Window Extraction: building around the keywords and keyphrases, windows encapsulating the context of these key units of information are extracted. Such windows should best embody the meaning of the selected keyword in context, so as to determine its function in text.
  3. Context Rule Creation: using the extracted context windows, the domain expert is put into action, where he or she will determine which of the windows best describe the meaning of the predefined classes. The set of these selected rules will form the basis for automated dataset creation.
  4. Extrapolation: for each class, the set of context rules are leveraged in conjunction with NLP techniques to "extrapolate" from the finite set of rules to a theoretically infinite number of unseen documents. With this step, the bridge between manually-defined rules and fully-unsupervised classification is crossed. 

As illustrated above, the output of the prosed pipeline is a structured dataset mapping defined characteristics (classes) to individual documents.

The goal of this project is to build upon early research regarding the utilization of this new method for the creation of data with the assistance of a domain expert. The main contribution will be the stregthening of each part of the pipeline by implementing state-of-the-art techniques, as well as designing new improvements to these. In addition, the ultimate goal is to create an open-source, publicly usable website, where the fruits of the project can be explored and further utilized. 

To aid in the completion of the project, the following research questions have been defined:

In what way can current state-of-the-art Natural Language Processing techniques be augmented to incorporate specific domain knowledge, with the goal of transforming unstructured text to structured datasets?

  1. How can domain experts be supported in the definition of classes for characterizing large text corpora, particularly in the creation of keywords and keyphrases?
  2. Which NLP techniques are best suited for the extraction of coherent windows of context centered around predefined keywords?
  3. In which way can a set of context rules be most efficiently and accurately applied to large-scale text corpora?
  4. How can the accuracy and usability of the proposed pipeline be validated and evaluated?
  5. What is the best way to present the resulting research, such that users of varying backgrounds and expertise can utilize its capabilities?

This project is supported by the Bayerisches Staatsministerium für Wirtschaft, Landesentwicklung und Energie (StMWi), and is conducted in cooperation with Fusionbase GmbH (Munich, Germany).