Twitter loaded into DEX: the 4.5 billion graph

We would like to share one of the latest use cases we have recently performed at Sparsity Technologies with DEX.

We have created a DEX graph loading all tweets made from June to December 2009, with a total of more than 476M tweets and 1.2 billion follower relations.

The resulting graph has 655M nodes belonging to 4 different types: users, tweet, hashtag and url. Former nodes have relations such as retweet, follows or references making a total of 2.2 billion edges present in the graph. Nodes and edges have a total of 1.6 billion attributes. Twitter objects make a final 4.5 billion DEX graph database.

See the resulting schema:

The loading was made in a Linux machine with 64GB RAM and a single CPU and the resulting graph database has 192GB; this is 3 times the available memory.

Resulting graph is ready to be queried. If you were wondering if social networks such as twitter or facebook could take advantage of graph databases, we are positive this test contributes to the exploration of this interesting scenario.

Questions like “Is it reasonable to work with such a big graph?” or “How long queries to the graph will take?” arise. We’ll return with more information.

This entry was posted in DEX and tagged , , , , , , , , , , . Bookmark the permalink.

10 Responses to Twitter loaded into DEX: the 4.5 billion graph

  1. Nick says:

    Awesome.

    How long did it take to load the graph into the system?

  2. admin says:

    Thank you Nick!

    This was a huge load for such amount of memory and single CPU. The loading took 3 days.

  3. Jesper says:

    Interesting result on an interesting dataset!

    Could you please describe the hardware configuration in greater detail? Which RAID configuration was used and on which kind of disks? What block-size does the nodes have? Is it a quad-core CPU and at which clock speed? How was the graph generated – in a big sequential write or in multiple operations (with the extra seek time that takes).

    I look forward to your response!

    • admin says:

      Jesper Thank you for your interest!

      The processors we used are 2 Intel Xeon E5440 processors at 2.83GHz with 64GB RAM and 8 disks of 250GB.

      We generated the graph with multiple operations, inserting entities in the graph, detecting relationships of followings and followers, detecting relations inside the tweets messages such as RT or @user or hastags and other operations.

      If you wish more information do not hestitate to send us an email to info@sparsity-technologies and we’d give you more details!

  4. Jesper says:

    By the way – is the Twitter dataset used in this test available to the public?

    • admin says:

      I’m afraid our specific dataset is not available. I think that some academic website may have other datasets to download.

  5. Ignacio says:

    Congratulations!
    How did you overcome the Twitter rate limiting?
    In other terms, how did you gather that huge amount of data?

  6. admin says:

    Thank you Ignacio! We used an accumulated dataset.

  7. Pingback: What challenging problems are being solved through graph processing? - Quora

  8. Pingback: A Survey on Graph Databases « Elements of Study on Information Technologies

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>