Microblogging queries on graph databases

In the last edition of the GRADES Workshop co-located with SIGMOD/PODS Conference, a group of researchers from the RMIT University (Australia) presented the paper “Microblogging Queries on Graph Databases: An Introspection”. In this paper the authors shared their experience on executing a wide variety of micro-blogging queries on two popular graph databases: Neo4j and Sparksee(*). Their queries were designed to be relevant to popular applications making use of microblogging data such as from Twitter providing friend recommendations, analyzing user influence or finding co-occurrences. The queries were executed on a large real graph data set consisting of nearly 50 million nodes and 326 million edges. In this post we are going to discuss about the conclusions drawn by the researchers of the paper from the execution of 2 of the most relevant advanced queries: recommendation queries and influence queries.

Recommendation queries

User recommendation on microblogging sites like Twitter usually involves looking at 1-step and 2-step followers and/or followees, given that it is more probable to know or share interests with the local community that the friends of your friends (or followers) create rather than with an outsider. Taking this into account the paper describes the following recommendation query:

  • Q4.1 finds all the 2-step followees of a given user A, who A is not following. Such followees are recommended to A.

To implement Q4.1 Sparksee offers the neighbours operator which will return all the followers of a certain user hence all the neighbouring nodes for the edge follows for the given user. This would be a good example of the type of information to materialise at the creation phase of your database so you’ll have an index created to access to them and will result in faster retrieval queries. In this case the authors decided to avoid materialisation during the import phase to make it faster. It is always a trade-off between having a faster import/creation or better query performance that should be considered regarding each particular scenario. The result of executing Q4.1 are shown in Figure 1.

recom

Figure 1: recommendation query (Q4.1) execution results.

Finding 2-step followees results in an explosion of nodes when 1-step followees have high out-degree forcing the systems to keep a large portion of the graph in memory. The authors explain the sudden spike in the plot for Neo4j with the fact that, the direct degree of the node in concern is much higher even though the number of rows returned are lower and they think noteworthy to mention that “Neo4j’s performance degrades with a large intermediate result in memory while Sparksee is able to take advantage of the graph already in memory observing less fluctuations with the output”.

When looking at Figure 1, one should also note the scale differences on the Y axis between the two plots, which can be misleading. The plot on the right (Sparksee) shows the average time in almost 2 orders of magnitude less than the plot on the left (Neo4j). So for example at 750k rows Sparksee’s average time is around 1.7 seconds, while Neo4j’s is around 47 seconds.

 

Influence queries

Trying to discover which is the current or potential influence of a user on his or her community is useful in a wide range of situations from affiliate marketing strategies to ad targeting. Although there are plenty of models of influence propagation the authors in the paper take an intuitive road defining it as:

  • Current influence: the most frequent users who mention someone and who are already followers of that user.
  • Potential influence: people who are most mentioning an user without being direct followers of that user.

Both in Neo4j and Sparksee this translates to finding the users who mentioned A, and removing (or retaining) the users who are already following A. The performance results of finding influencers are shown in Figure 2:

influ

Figure 2: influence query (Q5.2) execution results.

Like in Figure 1, the plot on the right (Sparksee) shows the average time in 2 orders of magnitude less than the plot on the left (Neo4j), which means that similar plot profiles reflect significant performance differences. For instance users with 60K mentions on twitter are identified using Sparksee in only 0.3 seconds.

In the paper the authors also discuss other queries like the built-in shortest path algorithms of both databases. Sparksee has improved its performance in the latest version 5.2.

In case you are interested in more details, please check the complete article here. If you are interested in benchmarking Sparksee or using it to leverage your Research do not hesitate to ask us for a free license under our Research program!

 

(*) Experiments from the paper were conducted on a standard Intel Core 2 Duo 3.0 Ghz and 8GB RAM with a non-SSD HDD. Neo4j v2.2M03 & Sparksee v5.1

This entry was posted in Research, Sparksee. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *