German Legal Information Retrieval & Query Expansion with Word Embeddings

Feb 13, 2020


Legal research is an essential part of many legal experts daily work [La15]. A lot of AI&Law research focuses on technical support for legal research, i.e. legal information retrieval [Op17]. Legal information retrieval has several specifics in comparison to general information retrieval, for example, different document types, large corpora, a broad range of different audiences and a specific language that uses synonymy, cf. [Op17].

Word embeddings is a technology that represents words and discrete text with continuous representations. Since the presentation of efficient methods to calculate word embeddings by Mikolov et al. [Mi13a], science experiences a significant shift towards word embeddings based representations of words and text. Word embeddings have intriguing characteristics, for example, linear regularities among word vectors, but also a tendency that synonymous words are close in the embedding space [Mi13b].

Selected publications in the history of word embeddings development

Query expansion is a frequently used approach in the legal domain to cope with the synonym used in legal language, see for example [Sc07]. In this research project, we investigate the potential of word embeddings to improve legal information retrieval with query expansion, in particular for the German legal domain.


Thesaurus Extension (Explicit Query Expansion)

Thesauri can be considered as lightweight ontologies that contain for example synonym groups (synsets) of terms. Thesauri are used for query expansion in legal information retrieval in a controlled fashion. Word embeddings characteristics can be used to identify synonymous words to particular words or synsets and therefore to extend existing thesauri. We investigate different word embeddings technologies such as the word2vec [Mi13a], FastText [Bo17] and GloVe [Pe14] as wells different approaches to calculate synset embeddings that can be used to identify new candidates for inclusion into synsets using cosine similarity. Parts of this research are conducted in cooperation with Datev eG on a German tax law corpus and thesaurus.

Thesaurus Extension with Vanilla Synset Embeddings Approach


Implicit Query Expansion

Due to the linear regularities of word embeddings, word embeddings can be accumulated to represent text. Accumulated word embeddings can be seen as an alternative document representation to traditional term frequency based document representations such as term frequency - inverse document frequency (TFIDF). A working hypothesis is that word embeddings based text representations implicitly conduct query expansion.


Semantic Text Matching

In parts of this research project, we focus on problems that can be described as a Semantic Text Matching problem. Semantic Text Matching is the identification of semantically and/or logically related text fragments among different documents. Citation network analysis focuses on the identification of explicit links in or among documents. Similarly, Semantic Text Matching (STM) can be seen to identify implicit links (in or) among documents. STM problems often occur in argumentation mining [Mo11] and word embeddings have been investigated as a possible solution, for example, by [Ri15] and [Na15]. Semantic Text Matching is also related to but different from textual entailment, see for example [Ad16].

We study a particular use case in German tenancy law, where contract paragraphs are matched against legal comment chapters. To some degree, this can also be seen as a legal information retrieval task. Parts of this research are investigated with Haufe Group. A traditional search method in the legal domain is keyword search, see for example [Pe05]. In this research project, we explore Selection Search (users select text in an existing document as input to a search query) and traditional keyword search integrated into popular text processing tools in the (German) legal domain as potentially suitable human-computer interaction method that could leverage from implicit query expansion. 



[La19a]Landthaler, J.; Glaser, I; Lecker, H.; Matthes, F.:User Study on Selection Search and Semantic Text Matching in German Tenancy Law, in: Weblaw, Jusletter IT 21. Februar 2019
[La19a]Landthaler, J.; Glaser, I; Lecker, H.; Matthes,F.: User Study on Selection Search and Semantic Text Matching in German Tenancy Law, IRIS: Internationales Rechtsinformatik Symposium, Salzburg, Austria, 2019
[La18c]Landthaler, J.; Glaser, I.; Matthes, F.: Explainable Semantic Text Matching, JurixInternational Conference on Legal Knowledge and Information Systems, Groningen, Netherlands (to appear)
[La18a]Landthaler, J.; Scepankova, E.; Glaser, I; Lecker, H.; Matthes, F.: Semantic Text Matching of Contract Clauses and Legal Comments in Tenancy Law, IRISInternationales Rechtsinformatik Symposium, Salzburg, Austria, 2018
[La17a]Landthaler, J.; Waltl, B.; Huth, D.; Braun, D.; Stocker, C.; Geiger, T.; Matthes, F.: Extending Thesauri Using Word Embeddings and the Intersection Method, Proc. of 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL’17), London, UK, June 16, 2017,
[La16c] Landthaler, J.; Waltl, B.; Holl, P.; Matthes, F.: Extending Full Text Search for Legal Document Collections using Word Embeddings, Jurix: International Conference on Legal Knowledge and Information Systems, Sofia Antopolis, France, 2016




