Refine
Year of publication
- 2020 (2) (remove)
Document Type
- Bachelor Thesis (1)
- Doctoral Thesis (1)
Language
- English (2) (remove)
Keywords
Institute
- Institute for Web Science and Technologies (2) (remove)
Social media platforms such as Twitter or Reddit allow users almost unrestricted access to publish their opinions on recent events or discuss trending topics. While the majority of users approach these platforms innocently, some groups have set their mind on spreading misinformation and influencing or manipulating public opinion. These groups disguise as native users from various countries to spread frequently manufactured articles, strong polarizing opinions in the political spectrum and possibly become providers of hate-speech or extremely political positions. This thesis aims to implement an AutoML pipeline for identifying second language speakers from English social media texts. We investigate style differences of text in different topics and across the platforms Reddit and Twitter, and analyse linguistic features. We employ feature-based models with datasets from Reddit, which include mostly English conversation from European users, and Twitter, which was newly created by collecting English tweets from selected trending topics in different countries. The pipeline classifies language family, native language and origin (Native or non-Native English speakers) of a given textual input. We evaluate the resulting classifications by comparing prediction accuracy, precision and F1 scores of our classification pipeline to traditional machine learning processes. Lastly, we compare the results from each dataset and find differences in language use for topics and platforms. We obtained high prediction accuracy for all categories on the Twitter dataset and observed high variance in features such as average text length especially for Balto-Slavic countries.
The distributed setting of RDF stores in the cloud poses many challenges. One such challenge is how the data placement on the compute nodes can be optimized to improve the query performance. To address this challenge, several evaluations in the literature have investigated the effects of existing data placement strategies on the query performance. A common drawback in theses evaluations is that it is unclear whether the observed behaviors were caused by the data placement strategies (if different RDF stores were evaluated as a whole) or reflect the behavior in distributed RDF stores (if cloud processing frameworks like Hadoop MapReduce are used for the evaluation). To overcome these limitations, this thesis develops a novel benchmarking methodology for data placement strategies that uses a data-placement-strategy-independent distributed RDF store to analyze the effect of the data placement strategies on query performance.
With this evaluation methodology the frequently used data placement strategies have been evaluated. This evaluation challenged the commonly held belief that data placement strategies that emphasize local computation, such as minimal edge-cut cover, lead to faster query executions. The results indicate that queries with a high workload may be executed faster on hash-based data placement strategies than on, e.g., minimal edge-cut covers. The analysis of the additional measurements indicates that vertical parallelization (i.e., a well-distributed workload) may be more important than horizontal containment (i.e., minimal data transport) for efficient query processing.
Moreover, to find a data placement strategy with a high vertical parallelization, the thesis tests the hypothesis that collocating small connected triple sets on the same compute node while balancing the amount of triples stored on the different compute nodes leads to a high vertical parallelization. Specifically, the thesis proposes two such data placement strategies. The first strategy called overpartitioned minimal edge-cut cover was found in the literature and the second strategy is the newly developed molecule hash cover. The evaluation revealed a balanced query workload and a high horizontal containment, which lead to a high vertical parallelization. As a result these strategies showed a better query performance than the frequently used data placement strategies.