The distributed setting of RDF stores in the cloud poses many challenges. One such challenge is how the data placement on the compute nodes can be optimized to improve the query performance. To address this challenge, several evaluations in the literature have investigated the effects of existing data placement strategies on the query performance. A common drawback in theses evaluations is that it is unclear whether the observed behaviors were caused by the data placement strategies (if different RDF stores were evaluated as a whole) or reflect the behavior in distributed RDF stores (if cloud processing frameworks like Hadoop MapReduce are used for the evaluation). To overcome these limitations, this thesis develops a novel benchmarking methodology for data placement strategies that uses a data-placement-strategy-independent distributed RDF store to analyze the effect of the data placement strategies on query performance.
With this evaluation methodology the frequently used data placement strategies have been evaluated. This evaluation challenged the commonly held belief that data placement strategies that emphasize local computation, such as minimal edge-cut cover, lead to faster query executions. The results indicate that queries with a high workload may be executed faster on hash-based data placement strategies than on, e.g., minimal edge-cut covers. The analysis of the additional measurements indicates that vertical parallelization (i.e., a well-distributed workload) may be more important than horizontal containment (i.e., minimal data transport) for efficient query processing.
Moreover, to find a data placement strategy with a high vertical parallelization, the thesis tests the hypothesis that collocating small connected triple sets on the same compute node while balancing the amount of triples stored on the different compute nodes leads to a high vertical parallelization. Specifically, the thesis proposes two such data placement strategies. The first strategy called overpartitioned minimal edge-cut cover was found in the literature and the second strategy is the newly developed molecule hash cover. The evaluation revealed a balanced query workload and a high horizontal containment, which lead to a high vertical parallelization. As a result these strategies showed a better query performance than the frequently used data placement strategies.
In a software reengineering task legacy systems are adapted computer-aided to new requirements. For this an efficient representation of all data and information is needed. TGraphs are a suitable representation because all vertices and edges are typed and may have attributes. Further more there exists a global sequence of all graph elements and for each vertex exists a sequence of all incidences. In this thesis the "Extractor Description Language" (EDL) was developed. It can be used to generate an extractor out of a syntax description, which is extended by semantic actions. The generated extractor can be used to create a TGraph representation of the input data. In contrast to classical parser generators EDL support ambiguous grammars, modularization, symbol table stacks and island grammars. These features simplify the creation of the syntax description. The collected requirements for EDL are used to determine an existing parser generator which is suitable to realize the requirements.
After that the syntax and semantics of EDL are described and implemented using the suitable parser generator. Following two extractors one for XML and one for Java are created with help of EDL. Finally the time they need to process some input data is measured.
TGraphBrowser
(2010)
This thesis describes the implementation of a web server that enables a browser to display graphs created with the Java Graph Laboratory (JGraLab). The user has the choice between a tabular view and a graphical presentation. In both views it is possible to navigate through the graph. Since graphs with thousands of elements may be confusing for the user, he or she is given the option to filter the displayed vertices and edges by their types. Furthermore, the number of graph elements shown can be limited by means of a GreQL query or by directly entering their ids.