DEX Analytical Use Case Benchmark: Wikipedia

As we have already announced in the previous post we would like to share an analytical use case that shows DEX high performance. This time we are taking a look at how DEX responds to some queries performed on a single dataset taking as a reference the results of another well-known open source graph database.

With this benchmark we would like to join the celebration of Wikipedia for its 10th anniversary. Wikipedia was launched in 2001 by Jimmy Wales and Larry Sanger and has become the largest and most popular general reference work on the Internet having 365 million readers. Our congratulations to everyone making Wikipedia possible!

For the benchmark we used all the Wikipedia articles written before January 2010. In particular the database loaded contained 55M articles, 2.1M images and 321M references between articles.

With this benchmark we want to obtain the following information:

  1. Loading times, including the generation of full index structures for the graph.
  2. Graph database size
  3. Response times for typical queries made to the loaded data, which include:
    1. Query 1(Q1): Finds the node with the maximum outdegree, the one with most relationships with other nodes, and then runs a BFS traversal of the graph starting from that node. More info about traversals algorithm in the graph algorithms post.
    2. Query 2(Q2): Finds the node with the maximum indegree, select nodes referencing that node, and with this new set, finds again those referencing every node in the set. In other words, it performs the 2-hops operation. Finally the query ranks the nodes by number of references and returns de top 5. 
    3. Query 3(Q3): Finds a pattern in the graph.  The pattern tries to find articles written in Catalan (CA) which are translated into English (EN) without some images from the original article.
    4. Query 4(Q4): Finds the number of articles and images for every language available.
    5. Query 5(Q5): Materializes the number of images for all the articles.
    6. Query 6(Q6): Deletes all the articles from the database with no images.

See the results in the following table:

It is remarkable to notice that loading all the articles from the Wikipedia to DEX only took 2.25 hours with a resulting database size of 16.98 GB. This shows the huge amounts of information that can be loaded to DEX with no disk restrictions in reasonable times.  The results for all six queries are always positive for DEX with results of more than two orders of magnitude for all the queries except for Q3, which still is one order of magnitude faster.

DEX gives the greatest performance results in both loading time and responding time to queries, making DEX an attractive option for those solutions with big volumes of data that are cumbersome to analyze.  Try DEX now and you’ll see how this performance is in action.

This entry was posted in DEX and tagged , , , , , , , , , , , , . Bookmark the permalink.

5 Responses to DEX Analytical Use Case Benchmark: Wikipedia

  1. Pingback: Tweets that mention DEX Analytical Use Case Benchmark: Wikipedia | --

  2. Pingback: A Survey on Graph Databases « Elements of Study on Information Technologies

  3. Could you please publish the code for the benchmark (both DEX and neo4j)?

    Otherwise the results are extremely doubtful as nobody else can run it and see how it has been run.

    • admin says:

      Hi Dmytrii,

      I can give you more details about the benchmark to see how it has been run.

      The experiments were performed using a computer with two quad-core Intel(R) Xeon(R) E5440 at 2.83 GHz. The memory hierarchy is organized as follows: 6144 KB second
      level cache, a 64 GB main memory and a Disk with 1.7TB. The operating system is Linux Debian etch 4.0.

      DEX contains a buffer pool to allow the out-of-core management of DEX graphs. The maximum buffer pool size has been set to a
      maximum of 60GB divided into 64KB pages.

      For all the experiments, each query has been executed five times, and the slowest and fastest results have been discarded. The reported resulting time is the average of the
      remaining three results. The DEX query engine is restarted before each execution to guarantee that its internal buffer pool is empty.

      Neo4j It has its own disk-based native storage manager and an edge traversal framework for
      query resolution. We have used Neo4J v1.5 with 45 GB reserved for the JVM. Also, three specific Lucene v3.1.0 indexes were created for the attributes ARTICLE ID, ARTICLE
      NLC and IMAGE ID to run the queries faster.

      I have contacted the DAMA-UPC research group which has performed the benchmark, but they are not able to give more details since the paper is submitted to a conference.

  4. Pingback: Graph Databases: I am developing a website which needs graph database ,but I didn't find fully open source graph database using php. Is there any working opensource graph databases? - Quora