Institute for Web Science and Technologies
Filtern
Erscheinungsjahr
- 2018 (9) (entfernen)
Dokumenttyp
- Masterarbeit (7)
- Bachelorarbeit (1)
- Dissertation (1)
Schlagworte
The purpose of this thesis is to explore the sentiment distributions of Wikipedia concepts.
We analyse the sentiment of the entire English Wikipedia corpus, which includes 5,669,867 articles and 1,906,375 talks, by using a lexicon-based method with four different lexicons.
Also, we explore the sentiment distributions from a time perspective using the sentiment scores obtained from our selected corpus. The results obtained have been compared not only between articles and talks but also among four lexicons: OL, MPQA, LIWC, and ANEW.
Our findings show that among the four lexicons, MPQA has the highest sensitivity and ANEW has the lowest sensitivity to emotional expressions. Wikipedia articles show more sentiments than talks according to OL, MPQA, and LIWC, whereas Wikipedia talks show more sentiments than articles according to ANEW. Besides, the sentiment has a trend regarding time series, and each lexicon has its own bias regarding text describing different things.
Moreover, our research provides three interactive widgets for visualising sentiment distributions for Wikipedia concepts regarding the time and geolocation attributes of concepts.
Navigation is a natural way to explore and discover content in a digital environment. Hence, providers of online information systems such as Wikipedia---a free online encyclopedia---are interested in providing navigational support to their users. To this end, an essential task approached in this thesis is the analysis and modeling of navigational user behavior in information networks with the goal of paving the way for the improvement and maintenance of web-based systems. Using large-scale log data from Wikipedia, this thesis first studies information access by contrasting search and navigation as the two main information access paradigms on the Web. Second, this thesis validates and builds upon existing navigational hypotheses to introduce an adaptation of the well-known PageRank algorithm. This adaptation is an improvement of the standard PageRank random surfer navigation model that results in a more "reasonable surfer" by accounting for the visual position of links, the information network regions they lead to, and the textual similarity between the link source and target articles. Finally, using agent-based simulations, this thesis compares user models that have a different knowledge of the network topology in order to investigate the amount and type of network topological information needed for efficient navigation. An evaluation of agents' success on four different networks reveals that in order to navigate efficiently, users require only a small amount of high-quality knowledge of the network topology. Aside from the direct benefits to content ranking provided by the "reasonable surfer" version of PageRank, the empirical insights presented in this thesis may also have an impact on system design decisions and Wikipedia editor guidelines, i.e., for link placement and webpage layout.
Wikipedia is the biggest, free online encyclopaedia that can be expanded by any-one. For the users, who create content on a specific Wikipedia language edition, a social network exists. In this social network users are categorised into different roles. These are normal users, administrators and functional bots. Within the networks, a user can post reviews, suggestions or send simple messages to the "talk page" of another user. Each language in the Wikipedia domain has this type of social network.
In this thesis characteristics of the three different roles are analysed in order to learn how they function in one language network of Wikipedia and apply them to another Wikipedia network to identify bots. Timestamps from created posts are analysed to reveal noticeable characteristics referring to continuous messages, message rates and irregular behaviour of a user are discovered. Through this process we show that there exist differences between the roles for the mentioned characteristics.
We examine the systematic underrecognition of female scientists (Matilda effect) by exploring the citation network of papers published in the American Physical Society (APS) journals. Our analysis shows that articles written by men (first author, last author and dominant gender of authors) receive more citations than similar articles written by women (first author, last author and dominant gender of authors) after controlling for the journal of publication, year of publication and content of the publication. Statistical significance of the overlap between the lists of references was considered as the measure of similarity between articles in our analysis. In addition, we found that men are less likely to cite articles written by women and women are less likely to cite articles written by men. This pattern leads to receiving more citations by articles written by men than similar articles written by women because the majority of authors who published in APS journals are male (85%). We also observed Matilda effect reduces when articles are published in journals with the highest impact factors. In other words, people’s evaluation of articles published in these journals is not affected by the gender of authors significantly. Finally, we suggested a method that can be applied by editors in academic journals to reduce the evaluation bias to some extent. Editors can identify missing citations using our proposed method to complete bibliographies. This policy can reduce the evaluation bias because we observed papers written by female scholars (first author, last author, the dominant gender of authors) miss more citations than articles written by male scholars (first author, last author, the dominant gender of authors).
Ontologien sind wichtige Werkzeuge zur Wissensrepräsentation und elementare Bausteine des Semantic Web. Sie sind jedoch nicht statisch und können sich über die Zeit verändern. Die Gründe hierfür sind vielfältig: Konzepte innerhalb einer Ontologie können fehlerhaft modelliert worden sein, die von der Ontologie repräsentierte Domäne kann sich verändern oder eine Ontologie kann wiederverwendet werden und muss an den neuen Kontext angepasst oder mit bestehenden Ontologien verbunden werden. Die Schwierigkeit dieses Prozesses hat zur Entstehung des Forschungsfeldes der Ontology Change geführt. Das Entfernen von Wissen aus Ontologien ist ein wichtiger Aspekt dieses Änderungsprozesses, da selbst das Hinzufügen neuen Wissens zu einer Ontologie das Entfernen bestehenden Wissens notwendig machen kann, falls dieses mit den neuen Vorstellungen in Konflikt steht. Dieses Entfernen muss jedoch wohldurchdacht sein, da das Ändern bestehender Konzepte leicht zu viel Wissen aus der Ontologie entfernen oder die semantische Bedeutung der Konzepte auf eine potenziell unerwartete Weise verändern kann. In dieser Arbeit wird daher ein formaler Operator zum präzisen Entfernen von Wissen aus Konzepten vorgestellt. Dieser basiert auf der Beschreibungslogik EL und baut partiell auf den Postulaten für Belief Set und Belief Base Contraction sowie der Arbeit von Suchanek et al. auf. Hierfür wird zunächst ein Einstieg in das Thema Ontologien und die Ontologiesprache OWL 2 gegeben und das Problemfeld der Ontology Change wird erläutert. Es wird dann gezeigt, wie ein formaler Operator diesen Prozess unterstützen kann und weshalb die Beschreibungslogik EL einen guten Ausgangspunkt für die Entwicklung eines solchen Operators darstellt. Anschließend wird ein Einblick in das Feld der Beschreibungslogiken gegeben. Hierfür wird die Geschichte der Beschreibungslogik kurz umrissen, Anwendungsgebiete werden genannt und es werden Standardprobleme in dieser Logik erläutert. In diesem Zusammenhang wird die Beschreibungslogik EL formal eingeführt. In einem nächsten Schritt werden verwandte Arbeiten untersucht und es wird gezeigt, warum das Recovery- und Relevance-Postulat für das Entfernen von Wissen aus Konzepten nicht unmittelbar anwendbar ist. Die hier gewonnenen Erkenntnisse werden anschließend dazu genutzt, die Anforderungen an den Operator zu formalisieren. Diese basieren hauptsächlich auf den Postulaten für Belief Set und Belief Base Contraction. Zusätzlich werden weitere Eigenschaften formuliert welche den Verlust des Recovery- bzw. Relevance-Postulates ausgleichen sollen. In einem nächsten Schritt wird der Operator definiert und es wird gezeigt, dass diese Definition das präzise Entfernen von Wissen aus EL-Konzepten gestattet. Mittels formaler Beweise wird zudem gezeigt, dass diese Definition alle zuvor aufgestellten Anforderungen erfüllt. In einem weiteren Beispiel wird dargestellt, wie der Operator in Verbindung mit sogenannten Laconic Justifications verwendet werden kann, um einen menschlichen Ontology-Editor durch das automatisierte Entfernen von unerwünschten Konsequenzen aus der Ontologie zu unterstützen. Aufbauend auf Algorithmen, welche aus der formalen Definition des Operators abgeleitet wurden, wird ein Plugin zum Entfernen von Wissen aus Ontologien für den Ontology-Editor Protégé vorgestellt. Anschließend werden die bisherigen Erkenntnisse zusammengefasst und es wird ein Fazit gezogen. Die Arbeit schließt mit einem Ausblick über mögliche zukünftige Forschung.
Knowledge-based authentication methods are vulnerable to Shoulder surfing phenomenon.
The widespread usage of these methods and not addressing the limitations it has could result in the user’s information to be compromised. User authentication method ought to be effortless to use and efficient, nevertheless secure.
The problem that we face concerning the security of PIN (Personal Identification Number) or password entry is shoulder surfing, in which a direct or indirect malicious observer could identify the user sensitive information. To tackle this issue we present TouchGaze which combines gaze signals and touch capabilities, as an input method for entering user’s credentials. Gaze signals will be primarily used to enhance targeting and touch for selecting. In this work, we have designed three different PIN entry method which they all have similar interfaces. For the evaluation, these methods were compared based on efficiency, accuracy, and usability. The results uncovered that despite the fact that gaze-based methods require extra time for the user to get familiar with yet it is considered more secure. In regards to efficiency, it has the similar error margin to the traditional PIN entry methods.
This Master Thesis is an exploratory research to determine whether it is feasible to construct a subjectivity lexicon using Wikipedia. The key hypothesis is that that all quotes in Wikipedia are subjective and all regular text are objective. The degree of subjectivity of a word, also known as ''Quote Score'' is determined based on the ratio of word frequency in quotations to its frequency outside quotations. The proportion of words in the English Wikipedia which are within quotations is found to be much smaller as compared to those which are not in quotes, resulting in a right-skewed distribution and low mean value of Quote Scores.
The methodology used to generate the subjectivity lexicon from text corpus in English Wikipedia is designed in such a way that it can be scaled and reused to produce similar subjectivity lexica of other languages. This is achieved by abstaining from domain and language-specific methods, apart from using only readily-available English dictionary packages to detect and exclude stopwords and non-English words in the Wikipedia text corpus.
The subjectivity lexicon generated from English Wikipedia is compared against other lexica; namely MPQA and SentiWordNet. It is found that words which are strongly subjective tend to have high Quote Scores in the subjectivity lexicon generated from English Wikipedia. There is a large observable difference between distribution of Quote Scores for words classified as strongly subjective versus distribution of Quote Scores for words classified as weakly subjective and objective. However, weakly subjective and objective words cannot be differentiated clearly based on Quote Score. In addition to that, a questionnaire is commissioned as an exploratory approach to investigate whether subjectivity lexicon generated from Wikipedia could be used to extend the coverage of words of existing lexica.
The content aggregator platform Reddit has established itself as one of the most popular websites in the world. However, scientific research on Reddit is hindered as Reddit allows (and even encourages) user anonymity, i.e., user profiles do not contain personal information such as the gender. Inferring the gender of users in large-scale could enable the analysis of gender-specific areas of interest, reactions to events, and behavioral patterns. In this direction, this thesis suggests a machine learning approach of estimating the gender of Reddit users. By exploiting specific conventions in parts of the website, we obtain a ground truth for more than 190 million comments of labeled users. This data is then used to train machine learning classifiers to use them to gain insights about the gender balance of particular subreddits and the platform in general. By comparing a variety of different approaches for classification algorithm, we find that character-level convolutional neural network achieves performance with an 82.3% F1 score on a task of predicting a gender of a user based on his/her comments. The score surpasses 85% mark for frequent users with more than 50 comments. Furthermore, we discover that female users are less active on Reddit platform, they write fewer comments and post in fewer subreddits on average, when compared to male users.
Topic Models sind ein beliebtes Werkzeug um Themen in großen Textkorpora zu identifizieren. Diese Textkorpora enthalten oft versteckte Meta-Gruppen. Das Größenverhältnis zwischen diesen Gruppen variiert meist stark. Die Präsenz dieser Gruppen wird in der Praxis oft ignoriert. Diese Masterarbeit erforscht daher, ob diese Gruppen Einfluss auf ein Topic Model haben.
Um den Einfluss zu testen, wird LDA auf Samples mit unterschiedlichen Gruppengrößen trainiert. Die Samples werden von Textkorpora mit großen Gruppenunterschieden (d.h. Sprachunterschieden) und kleinen Gruppenunterschieden (d.h. Unterschiede in der politische Orientierung) generiert. Die Leistungsfähigkeit von LDA wird per "Perplexity" evaluiert.
Der Einfluss von Gruppen auf die generelle Leistungsfähigkeit von Topic Models hängt von verschiedenen Faktoren der Gruppen ab, z.B. der Vorhersagbarkeit der Sprache generell. Die Leistungsfähigkeit der Topic Models für die einzelnen Gruppen wird von der Variation der relativen Gruppengrößen beeinflusst. Allerdings ist der Effekt für alle Datensätze verschieden.
LDA kann die Gruppen intern unterscheiden, wenn die Unterschiede der Gruppen groß genug sind (z.B. Sprachunterschiede). Der Anteil der Topics, die explizit für eine Gruppe gelernt werden, ist jedoch unterproportional zu dem Anteil der Gruppe im Trainingskorpus. Dieser Effekt verstärkt sich für kleinere Minderheiten.