One of the most important challenges for graph databases is how to express graph queries and how to solve them efficiently. There is an important gap between the current industrial approach with libraries of very efficient APIs for procedural languages as Java, and high level languages like Gremlin and Cypher that combine pipes or data flows of results from the execution of graph APIs. Also, it is very difficult to optimize complex graph oriented programs based on direct calls to low level APIs. In recent years several proposes have raised from the research community, but these solutions are still far from being adopted by the graph database vendors.

In the (ACM publishing pendant) paper presented by Sparsity Technologies at GRADES workshop 2014, we propose an algebra with a set of operations to solve graph queries in Sparksee, and we introduce extensions to SQL-like features to compute recursive programs and collection-oriented procedures typical for graph algorithms. The operations of this algebra are capable to reproduce the behavior of the current Sparksee APIs and, at the same time, are flexible enough to be combinable in the form of query plans that are suitable for optimization using the most usual database query optimization techniques as well as future optimizations more specific of graph patterns and graph traversals. Also, by adapting the operations to the graph data representation of the Sparksee engine, the query runtime will be able to take advantage of compressed bitmap processing and combination.

An important property of the operations is that most of them belong or have been adapted from the relational algebra, and the data structures in the form of key-value pairs are close to the relational model with multivalued attributes (Non First Normal Form relations). In particular, our main operation is the semijoin, which has been widely used in database technology for distributed systems, and semijoin program optimization has been studied and formalized in detail. Thus, our proposal is a compromise between the relational model and the noSQL and, in more detail, key-value pairs storages and graph querying systems. On the other hand, our procedural extensions for parallel subquery execution and recursion can be used to simulate graph analytical frameworks for the computation of complex graph algorithms over huge graphs.

Our next steps are to implement the prototype of the query engine with full support of the algebra, and to test and validate the performance of the proposal. After that, we will focus on query plan optimization for semijoin programs combined to our non-relational extensions for the resolution of complex queries over large graphs.

Which are your thoughts on our approach of using the semijoin as main operation? Also, remember that we have a free license program for researchers that would like to use Sparksee.