With the rise of digitalization, information retrieval has to cope with increasing amounts of digitized content. Legal content providers invest a lot of money for building domain- specific ontologies such as thesauri to retrieve a significantly increased number of relevant documents. Since 2002, many label propagation methods have been developed e.g. to identify groups of similar nodes in graphs. Label propagation is a family of graph-based semi-supervised machine learning algorithms. In this thesis, we will test the suitability of label propagation methods to extend a thesaurus from the tax law domain. The graph on which label propagation operates is a similarity graph constructed from word embeddings. We cover the process from end to end and conduct several parameter-studies to understand the impact of certain hyper-parameters on the overall performance. The results are then evaluated in manual studies and compared with a baseline approach.
This thesis is carried out in cooperation with Prof. Dr. Günnemann who holds the Professorship of Data Mining and Analytics at the chair for Datenbanksysteme at TUM.
Keywords: Thesaurus Extension, Legal Tech, Information Retrieval, Label Propagation, Word Embeddings, Data Science, Machine Learning
Name | Type | Size | Last Modification | Last Editor |
---|---|---|---|---|
180604 Mueller Label Propagation Thesaurus Extension MA Kick-off.pdf | 835 KB | 04.06.2018 | Markus Müller | |
180604 Mueller Label Propagation Thesaurus Extension MA Kick-off.pptx | 9,76 MB | 04.06.2018 | Markus Müller | |
181107 Mueller Label Propagation Thesaurus Extension MA Thesis.pdf | 4,68 MB | 07.11.2018 | Markus Müller | |
181109 Mueller Label Propagation Thesaurus Extension MA Final.pdf | 8,98 MB | 09.11.2018 | Markus Müller | |
181109 Mueller Label Propagation Thesaurus Extension MA Final.pptx | 8,81 MB | 09.11.2018 | Markus Müller |