University of Twente Student Theses
Effects of inserting domain vocabulary and fine-tuning BERT for German legal language
Yeung, Chin Man (2019) Effects of inserting domain vocabulary and fine-tuning BERT for German legal language.
PDF
7MB |
Abstract: | We explore in this study the effects of domain adaptation in NLP using the state-of-the-art pre-trained language model BERT. Using its German pre-trained version and a dataset from OpenLegalData containing over 100,000 German court decisions, we fine-tuned the language model and inserted legal domain vocabulary to create a German Legal BERT model. We evaluate the performance of this model on downstream tasks including classification, regression and similarity. For each task, we compare simple yet robust machine learning methods such as TFIDF and FastText against different BERT models, mainly the Multilingual BERT, the German BERT and our fine-tuned German Legal BERT. For the classification task, the reported results reveal that all models were equally performant. For the regression task, our German Legal BERT model was able to slightly improve over FastText and the other BERT models but it is still considerably outperformed by TFIDF. In a within-subject study (N=16), we asked subjects to evaluate the relevancy of documents retrieved by similarity compared to a reference case law. Our findings indicate that the German Legal BERT, to a small degree, was able to capture better legal information for comparison. We observed that further fine-tuning a BERT model in the legal domain when the pre-trained language model already included legal data yields marginal gains in performance. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Interaction Technology MSc (60030) |
Link to this item: | https://purl.utwente.nl/essays/80128 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page