Subjectivity Detection through Quotation Analysis in Wikipedia

  • This Master Thesis is an exploratory research to determine whether it is feasible to construct a subjectivity lexicon using Wikipedia. The key hypothesis is that that all quotes in Wikipedia are subjective and all regular text are objective. The degree of subjectivity of a word, also known as ''Quote Score'' is determined based on the ratio of word frequency in quotations to its frequency outside quotations. The proportion of words in the English Wikipedia which are within quotations is found to be much smaller as compared to those which are not in quotes, resulting in a right-skewed distribution and low mean value of Quote Scores. The methodology used to generate the subjectivity lexicon from text corpus in English Wikipedia is designed in such a way that it can be scaled and reused to produce similar subjectivity lexica of other languages. This is achieved by abstaining from domain and language-specific methods, apart from using only readily-available English dictionary packages to detect and exclude stopwords and non-English words in the Wikipedia text corpus. The subjectivity lexicon generated from English Wikipedia is compared against other lexica; namely MPQA and SentiWordNet. It is found that words which are strongly subjective tend to have high Quote Scores in the subjectivity lexicon generated from English Wikipedia. There is a large observable difference between distribution of Quote Scores for words classified as strongly subjective versus distribution of Quote Scores for words classified as weakly subjective and objective. However, weakly subjective and objective words cannot be differentiated clearly based on Quote Score. In addition to that, a questionnaire is commissioned as an exploratory approach to investigate whether subjectivity lexicon generated from Wikipedia could be used to extend the coverage of words of existing lexica.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Author:Pang Wayne Wee
Referee:Claudia Wagner, Fabian Flöck
Document Type:Master's Thesis
Date of completion:2018/05/29
Date of publication:2018/06/29
Publishing institution:Universität Koblenz, Universitätsbibliothek
Granting institution:Universität Koblenz, Fachbereich 4
Date of final exam:2018/06/19
Release Date:2018/06/29
Number of pages:ix, 36
Institutes:Fachbereich 4 / Institute for Web Science and Technologies
Licence (German):License LogoEs gilt das deutsche Urheberrecht: § 53 UrhG