University of Twente Student Theses

Login

Effects of inserting domain vocabulary and fine-tuning BERT for German legal language

Yeung, Chin Man (2019) Effects of inserting domain vocabulary and fine-tuning BERT for German legal language.

[img] PDF
7MB
Abstract:We explore in this study the effects of domain adaptation in NLP using the state-of-the-art pre-trained language model BERT. Using its German pre-trained version and a dataset from OpenLegalData containing over 100,000 German court decisions, we fine-tuned the language model and inserted legal domain vocabulary to create a German Legal BERT model. We evaluate the performance of this model on downstream tasks including classification, regression and similarity. For each task, we compare simple yet robust machine learning methods such as TFIDF and FastText against different BERT models, mainly the Multilingual BERT, the German BERT and our fine-tuned German Legal BERT. For the classification task, the reported results reveal that all models were equally performant. For the regression task, our German Legal BERT model was able to slightly improve over FastText and the other BERT models but it is still considerably outperformed by TFIDF. In a within-subject study (N=16), we asked subjects to evaluate the relevancy of documents retrieved by similarity compared to a reference case law. Our findings indicate that the German Legal BERT, to a small degree, was able to capture better legal information for comparison. We observed that further fine-tuning a BERT model in the legal domain when the pre-trained language model already included legal data yields marginal gains in performance.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Interaction Technology MSc (60030)
Link to this item:https://purl.utwente.nl/essays/80128
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page