• search hit 71 of 76
Back to Result List

Study on Data Placement Strategies in Distributed RDF Stores

  • The distributed setting of RDF stores in the cloud poses many challenges. One such challenge is how the data placement on the compute nodes can be optimized to improve the query performance. To address this challenge, several evaluations in the literature have investigated the effects of existing data placement strategies on the query performance. A common drawback in theses evaluations is that it is unclear whether the observed behaviors were caused by the data placement strategies (if different RDF stores were evaluated as a whole) or reflect the behavior in distributed RDF stores (if cloud processing frameworks like Hadoop MapReduce are used for the evaluation). To overcome these limitations, this thesis develops a novel benchmarking methodology for data placement strategies that uses a data-placement-strategy-independent distributed RDF store to analyze the effect of the data placement strategies on query performance. With this evaluation methodology the frequently used data placement strategies have been evaluated. This evaluation challenged the commonly held belief that data placement strategies that emphasize local computation, such as minimal edge-cut cover, lead to faster query executions. The results indicate that queries with a high workload may be executed faster on hash-based data placement strategies than on, e.g., minimal edge-cut covers. The analysis of the additional measurements indicates that vertical parallelization (i.e., a well-distributed workload) may be more important than horizontal containment (i.e., minimal data transport) for efficient query processing. Moreover, to find a data placement strategy with a high vertical parallelization, the thesis tests the hypothesis that collocating small connected triple sets on the same compute node while balancing the amount of triples stored on the different compute nodes leads to a high vertical parallelization. Specifically, the thesis proposes two such data placement strategies. The first strategy called overpartitioned minimal edge-cut cover was found in the literature and the second strategy is the newly developed molecule hash cover. The evaluation revealed a balanced query workload and a high horizontal containment, which lead to a high vertical parallelization. As a result these strategies showed a better query performance than the frequently used data placement strategies.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Author:Daniel Janke
Referee:Steffen Staab, Georg Lausen
Advisor:Steffen Staab
Document Type:Doctoral Thesis
Date of completion:2020/02/05
Date of publication:2020/02/06
Publishing institution:Universität Koblenz, Universitätsbibliothek
Granting institution:Universität Koblenz, Fachbereich 4
Date of final exam:2020/01/31
Release Date:2020/02/06
Number of pages:xiii, 197
Institutes:Fachbereich 4 / Institute for Web Science and Technologies
BKL-Classification:54 Informatik
Licence (German):License LogoEs gilt das deutsche Urheberrecht: § 53 UrhG