University of Twente Student Theses

Login

Improving text representations for NLP from bags to strings of words

Schoot Uiterkamp, L. (2019) Improving text representations for NLP from bags to strings of words.

[img] PDF
9MB
Abstract:When processing natural language, humans derive meaning from words using contextual information and word relations, which is something we do automatically but is very difficult for computers to achieve. In this thesis, a machine learning algorithm is used to estimate whether reviews of movies and tv shows taken from imdb give a positive or negative appraisal of their respective movie or show. In the method of representing these reviews so that they may be used by machine learning models that is currently standard, the original arrangement of the words in the review is lost, as the representation is ordered in terms of the predefined word list. This limits the use of contextual information and word relations, which forms a barrier to interpreting what was meant. An alternative to the bag of words text representation is presented. The ‘string of words’ representation represents texts in terms of the original words of a text, in the original order. It is tested against the bag of words representation in a neural network. To compare the two representations, performance and time measures were taken. The impact of representation length of the string of words representation and the lengths of the classified reviews were assessed in an exploratory analysis. The string of words representation outperforms the bag of words in time measures as well as performance measures but does come with its own limitations. The string of words representation performs best with texts that deviate little in length from the training texts and offers an advantage over the bag of words representation only if the text length is shorter than the number of words known to a machine learning model.
Item Type:Essay (Master)
Faculty:BMS: Behavioural, Management and Social Sciences
Subject:17 linguistics and theory of literature, 18 languages and literature, 54 computer science, 77 psychology
Programme:Psychology MSc (66604)
Link to this item:https://purl.utwente.nl/essays/79245
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page