Understanding Graph Structure of Wikipedia for Query Expansion

Query Expansion









Knowledge bases are very good sources for knowledge extraction, the ability to create knowledge from structured and unstructured sources and use it to improve automatic processes as query expansion. Wikipedia, in particular, could be analyzed to see how articles and categories relate to each other and how these relationships can support a query expansion technique. In particular, the authors of this article show that the structures in the form of dense cycles with a minimum amount of categories tend to identify the most relevant information.

Let’s see a little overview on this approach. Read the complete article by Joan Guisado and Arnau Prat, presented during last GRADES 2015 in Melbourne.

Query expansion is the process of expanding a query issued by a user, introducing new terms, called expansion features, in order to improve the quality of the retrieved results. Query expansion is motivated by the assumption that the query introduced by the user is not the best to express its real intention. For example, vocabulary mismatch between queries and documents is one of the main causes of a poor precision in information retrieval systems. Poor results also arise from the topic inexperience of the users. The challenge is to properly select the best expansion features.

Wikipedia has been proven to be a good source for query expansion, but the innovation in this paper lies in the fact of considering the differences between a social network and a knowledge base by:

  • Creating a ground truth consisting of those articles in Wikipedia that provide good results for each of the queries that are the baseline in the experiments.
  • Analysing how the articles and categories of the ground truth are structured within the Wikipedia graph.
  • Identifying cycles of articles and categories as an important structure and also trends within them. 30% of the dense cycles with minimum ratio of categories, are tagged as the best expansion features.
  • Identifying challenging and open problems for graph processing technologies when it comes to exploit structures of large graphs such as Wikipedia

Query Expansion

A quick analysis of the query graphs reveals that they are, in general, disconnected graphs composed by a moderately large connected component. This is an interesting observation as it means that, in general, the terms users introduce in a search engine are semantically related either directly or by means of extra articles or categories. This suggests that Wikipedia, contains this semantic relation encoded within its structure, and therefore, can be exploited. Also, we observe that the largest connected component is clearly dominated by categories.

Read about the results in the experiments here.

If you are interested in query expansion, make sure to check our first post with the basis of this on-going research.

Also, if you are using graphs for your research don’t hesitate to request being part of our Research program where we grant free licenses of Sparksee.

Posted in Research | Tagged , , , , , | Comments Off on Understanding Graph Structure of Wikipedia for Query Expansion

Sparsity involved in SOMATCH EU project: bringing BI to SMEs of the fashion industry

We are glad to announce that Sparsity will take part of the SOMATCH EU project under the Horizon 2020 programme. The main objective of the project is the improvement of the competitiveness of European SMEs dedicated to fashion design and Textile & Clothing (T&C) sectors by providing an IT solution that will hand over to creative designers detailed and reliable fashion trends estimations and forecasts of user acceptance for clothing designs. This will be achieved with the creation of an innovative tool for the mining and visualization of large sets of unstructured data, regarding the use and preferences of fashion products by consumers, supporting T&C companies quick reaction to the market dynamics and better adaptation of design to real consumers’ demand.

SOMATCH faces its complex and challenging deal by the combined development and application of SoA advanced image analysis technology- unexploited and innovative in clothing and fashion- combined with social network analysis.


The visualisation of the generated data will be performed from off-line statistics, generated after data processing, and by new real-time instruments for image collection and designs evaluation. They will be targeted also by the integration of the systems with new SoA mobile devices to collect information and to visualise trend interpretation. This approach will open a vast field of new approaches for the fashion designers, supporting end users involvement into the whole trend evaluation and a close interaction with them.

Sparsity’s role in the project is to provide the Social Networks Analysis to detect influential trendsetters powered by Sparksee high-performance graph database management system. Other project partners include research centres focused on image and content analysis (Technical University of Munich, UPC-Barcelonatech), software providers with expertise in the fashion industry and platform development (iDeal, Holonix), end users from SME textile industry and retail (Dena Cashmere) and fashion-related social networking and e-commerce sites owners (Weblogs, Not Just a Label).

Posted in Research, Sparksee | Tagged , , , , , , | Comments Off on Sparsity involved in SOMATCH EU project: bringing BI to SMEs of the fashion industry

Sparsity is the SME with the highest innovation capacity in Europe according to the European Commission

Sparsity Technologies is the SME with the highest innovation capacity in Europe according to European Commission’s first Innovation Radar (IR) Report published last week by the Joint Research Centre. The IR report is a support initiative that focuses on the identification of high-potential innovations in the ICT FP7, CIP and H2020 EU funded projects and the key organisation in delivering these innovations to the market. The report documents the details of the IR methodology and the results of its first application.

Between May 2014 and January 2015, the Commission reviewed 279 ICT projects, equivalent to the 10.6% of ICT projects funded by the EC, which had resulted in a total of 517 innovations, delivered by 544 organisations in 291 European cities.

Sparsity has been identified as a key organisation in two innovations:  the first one, providing a new method to partition any given graph into its community structure of strongly connected components, within the Linked Data Benchmark Council (LDBC) project whose goal was to create the first comprehensive suite of open, fair and vendor-neutral benchmarks for RDF/graph databases. The second, the development of the common query engine for the different “one size” data stores optimized for particular tasks, data, and workloads, within the CoherentPaaS project whose objective is to provide a rich PaaS (Platform as a service).

Top 10 SMEs and their innovations

Barcelona, where Sparsity has its headquarters, has been identified as the largest innovation hub in ICT in Europe, hosting a total of 19 organisations that are key players in innovation and beating London and Paris which each host 17 such organisations. According to the report, Germany, Spain and the UK are the countries with the most organisations that are key players in delivering innovations.

Sparsity wants to thank all the hard work and dedication of the talented Sparsity team that during all these years has made possible this great achievement!

Posted in News, Research | Tagged , , , , , | Comments Off on Sparsity is the SME with the highest innovation capacity in Europe according to the European Commission

Microblogging queries on graph databases

In the last edition of the GRADES Workshop co-located with SIGMOD/PODS Conference, a group of researchers from the RMIT University (Australia) presented the paper “Microblogging Queries on Graph Databases: An Introspection”. In this paper the authors shared their experience on executing a wide variety of micro-blogging queries on two popular graph databases: Neo4j and Sparksee(*). Their queries were designed to be relevant to popular applications making use of microblogging data such as from Twitter providing friend recommendations, analyzing user influence or finding co-occurrences. The queries were executed on a large real graph data set consisting of nearly 50 million nodes and 326 million edges. In this post we are going to discuss about the conclusions drawn by the researchers of the paper from the execution of 2 of the most relevant advanced queries: recommendation queries and influence queries.

Recommendation queries

User recommendation on microblogging sites like Twitter usually involves looking at 1-step and 2-step followers and/or followees, given that it is more probable to know or share interests with the local community that the friends of your friends (or followers) create rather than with an outsider. Taking this into account the paper describes the following recommendation query:

  • Q4.1 finds all the 2-step followees of a given user A, who A is not following. Such followees are recommended to A.

To implement Q4.1 Sparksee offers the neighbours operator which will return all the followers of a certain user hence all the neighbouring nodes for the edge follows for the given user. This would be a good example of the type of information to materialise at the creation phase of your database so you’ll have an index created to access to them and will result in faster retrieval queries. In this case the authors decided to avoid materialisation during the import phase to make it faster. It is always a trade-off between having a faster import/creation or better query performance that should be considered regarding each particular scenario. The result of executing Q4.1 are shown in Figure 1.


Figure 1: recommendation query (Q4.1) execution results.

Finding 2-step followees results in an explosion of nodes when 1-step followees have high out-degree forcing the systems to keep a large portion of the graph in memory. The authors explain the sudden spike in the plot for Neo4j with the fact that, the direct degree of the node in concern is much higher even though the number of rows returned are lower and they think noteworthy to mention that “Neo4j’s performance degrades with a large intermediate result in memory while Sparksee is able to take advantage of the graph already in memory observing less fluctuations with the output”.

When looking at Figure 1, one should also note the scale differences on the Y axis between the two plots, which can be misleading. The plot on the right (Sparksee) shows the average time in almost 2 orders of magnitude less than the plot on the left (Neo4j). So for example at 750k rows Sparksee’s average time is around 1.7 seconds, while Neo4j’s is around 47 seconds.


Influence queries

Trying to discover which is the current or potential influence of a user on his or her community is useful in a wide range of situations from affiliate marketing strategies to ad targeting. Although there are plenty of models of influence propagation the authors in the paper take an intuitive road defining it as:

  • Current influence: the most frequent users who mention someone and who are already followers of that user.
  • Potential influence: people who are most mentioning an user without being direct followers of that user.

Both in Neo4j and Sparksee this translates to finding the users who mentioned A, and removing (or retaining) the users who are already following A. The performance results of finding influencers are shown in Figure 2:


Figure 2: influence query (Q5.2) execution results.

Like in Figure 1, the plot on the right (Sparksee) shows the average time in 2 orders of magnitude less than the plot on the left (Neo4j), which means that similar plot profiles reflect significant performance differences. For instance users with 60K mentions on twitter are identified using Sparksee in only 0.3 seconds.

In the paper the authors also discuss other queries like the built-in shortest path algorithms of both databases. Sparksee has improved its performance in the latest version 5.2.

In case you are interested in more details, please check the complete article here. If you are interested in benchmarking Sparksee or using it to leverage your Research do not hesitate to ask us for a free license under our Research program!


(*) Experiments from the paper were conducted on a standard Intel Core 2 Duo 3.0 Ghz and 8GB RAM with a non-SSD HDD. Neo4j v2.2M03 & Sparksee v5.1

Posted in Research, Sparksee | Comments Off on Microblogging queries on graph databases

Sparksee 5.2 new release!

Sparksee 5.2 new features

We are proud to announce the release of Sparksee 5.2 today! Download your trial on our website or request a license for free.

New version includes the following features:

  • The fastest community detection algorithm: We have included the SCD algorithm which has been proven by the state-of-the-art to be the fastest & more precise community algorithm for unweighted graphs. It is now available as part of our graph algorithm package. Don’t forget to check our documentation to start using it!
  • Concurrency performance boost in read transactions: Changes in our internal structures allowing more concurrency at read level & reorganized locks to decrease its performance penalty.
  • Faster shortest path algorithm: For the BFS Shortest Path we have upgraded the mechanics of the algorithm to explore less nodes while improving its performance.
  • Added support for 64 bits processors on Android: We now support the 64bit architecture for Android development. Sparksee is the only available graph database for Android, take this advantage to leverage your mobile apps!

As always bug fixing and further improvement of the documentation has been implemented as well with the newest version, we would like to thank the feedback from our users!

Download Sparksee 5.2

Posted in News, Sparksee | Tagged , , , , , , | Comments Off on Sparksee 5.2 new release!

Query Expansion by Exploiting Graph Properties for Image Retrieval

When searching images for a certain query, ambiguity problems may arise. For instance, when searching for the query “colored Volkswagen beetle” the most common search engines would retrieve some images with the famous car, but also the bug and even a bird that also recalls for that name. Query expansion would avoid this problem by using a context to the query and adding related concepts that may additionally enrich the results or increase its precision. For example and going back to the former example, it would be maybe more useful that the results would contain images of other cars similar to the Volkswagen beetle rather than the fauna images.

Query Expansion - Sparsity Technologies

Figure 1: Image retrieval with query expansion process diagram

In Figure 1 the process for image retrieval including Query expansion is shown. It starts with the original query Qo which will be the entry point for the Query Expansion module which will then search for several complimenting terms and phrases that will be combined to deliver a Qe enriched query. Qe will be introduced in the regular search engine. The big question here is which are those terms to enrich the query and how we do find them? Accordingly to the research “Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval” by authors Joan Guisado-Gámez, David Dominguez-Sal and Josep L. Larriba-Pey the answer may be graph techniques.

Their method is based on the topological analysis of the graph built out of a knowledge base. They differ from previous works because they are not considering the links of each article individually, instead they are mining the global link structure of the knowledge base to find related terms using graph mining techniques. The technique consist in, first, locating the most relevant terms in the knowledge base, and connecting them by a path using the knowledge base relations, and second, extracting communities of concepts around the detected path. With this technique they are able to identify: (i) concepts that match with the user need, and (ii) a set of semantically related concepts. Matching concepts provide equivalent reformulations of the query that reduce the vocabulary mismatch. Semantically related concepts introduce a significant set of different terms that are likely to appear in a relevant document.

While in the first phase of this innovative Query Expansion technique it uses the shortest path to find the most direct way to connect a particular word to another related, on the second they are using community search on a graph to enrich the previously computed paths with articles that are closely related.

For the example query, the system finds 182 shortest paths using the English Wikipedia. Among them, nine score 3/2 that is the top score:

volkswagen→volkswagen beetle
volkswagen fox→volkswagen beetle
volkswagen passat→volkswagen beetle
volkswagen type 2→volkswagen beetle
volkswagen golf→volkswagen beetle
volkswagen jetta →volkswagen beetle
volkswagen touareg→volkswagen beetle
volkswagen golf mk4 →volkswagen beetle
volkswagen beetle→volkswagen transporter
The first path in the list is specially relevant because it connects the generic concept Volkswagen to the most specific context Volkswagen Beetle. The rest of the paths are also valuable because they refer to the real intent of the user and discard any path that contains articles about other interpretations of the term beetle.

For the second part of the technique, the most direct solution would be to enrich the path with all the neighbors of the Wikipedia articles. However, when they tested this naive solution, it did not work because it introduced articles that were loosely related to the path. Wikipedia articles have typically many links, and many of them refer to topics that have some type of relation but semantically are very distant. They implemented a community search algorithm to distinguish the semantically strong and weak links. A community in a graph is a set of closely linked nodes which are similar among them but are different from other nodes in the rest of the graph. On our example, on this second part of the query expansion process related terms such as German cars, Volkswagen group, vw bug, VW Type 1, Wolfsburg (which is the city where they were manufactured), Baja bugs (which refers to an original Volkswagen Beetle modified to operate off-road) or Cal Look (name used to refer to customized version of Volkswagen Beetle cars that follow a style coined in California in 1969) are added to enrich the results.

Check the complete article for the experimental results and precision measures!

Read more articles about research using graph databases in our blog. If you think your research will benefit from using a graph database consider using Sparksee, available for Phyton, Java, .Net, C++ and Objective-C. Request one of our FREE licenses under our Research Program.

Posted in Research | Tagged , , , , , , , | Comments Off on Query Expansion by Exploiting Graph Properties for Image Retrieval

Graph database use case: Insurance fraud detection

According to a fact sheet released by the Southwest Insurance Information Service (SIIS), Approximately 10% of all insurance claims are fraudulent, and nearly $80 billion in fraudulent claims are spent annually in the U.S., estimates the Coalition Against Insurance Fraud. Insurance fraud is certainly an issue that must be addressed given the benefits both the insurer and the insured will obtain from its prevention: the insurance buyer is able to receive coverage at a lower price, which gives the insurance company a competitive advantage.

Insurance fraud can be perpetrated by the seller or the buyer. Seller fraud occurs when the seller of a policy hijacks the usual process, in a way that maximizes his or her profit. Some examples are premium diversion, fee churning, ghost companies and worker’s compensation fraud. Buyer fraud occurs when the buyer deliberately invents or exaggerates a loss in order to obtain more coverage or receive payment for damages. Some examples are false medical history, murder for proceeds, post-dated life insurance and faking accidents.

Traditional methods to detect and prevent this form of fraud include duplicate testing, using date validation systems, calculating statistical parameters to identify outliers, using stratification or other types of analysis to identify unusual entries, and identifying gaps on sequential data. These methods are a great way to catch most of the casual, single fraudsters, but sophisticated fraud rings are usually well-organized and informed enough to avoid being spotted by the traditional means. They use layered “false” collusions in a similar way than money laundry rings.

In this scenario, where implementing alternative fraud detection methods is crucial, graph database management systems play a significant role. In the case of buyer fraud, the only way to catch the complex layered collusion performed by criminal rings is to analyze the relationships of the elements involved in the claim, which is a tedious task to perform on a relational database. While a RDBMS has to join a large number of tables –accidents, companies, drivers, lawyers, witnesses…- in complex schemes, a GDBMS only has to traverse the graph considering the relationships between the nodes, which is significantly more efficient, especially in cases that require querying large datasets. An example of buyer insurance fraud represented as a graph would be as follows:

graph post insurance 2

Example 1: Simple buyer insurance fraud case represented as a graph.

On this example, subjects 5 and 4 participate on both accidents in a direct way. Subject 1 is also related to both accidents in an indirect way, given that the car he drives is owned by the driver of Car 3, who is involved in Accident 2, which means Subject 1 and 5 must know each other. With the added value of social networks analytics— another scenario where graph DBMS usually outperform relational DBMS–we have one more clue pointing to our suspecting of fraud: Subject 3 and subject 6 are friends on Facebook, meaning that 5 out of 6 nodes of the graph are related to both accident 1 and accident 2 in some way, which should ring a bell to a possible fraud involved. In addition a graph database would make easy to add data from different sources in a changing schema and for instance move from subject 1 to 6 to see that they are in fact also related. To make it clear and easy to visualize we have used an example of a fraud ring that claims only two false accidents, but real cases of large fraud rings usually result in greater number of claims, where the relationships between the people involved can be hardly explained by coincidence.

A graph database could also be used to search for fraud across different insurance companies to find similarities in patterns and behaviors that could add value to an analysis like the one showed on the example above.

As we have said, a high performance graph database like Sparksee is a perfect match to deal with large amounts of data in situations where a deep relationships analysis is required. Remember that you can download it for free under evaluation or research license and use it for your own project. You can find more graph database use cases, scenarios and success stories searching for the “use case” tag on the blog or visiting the “scenarios” section of our website.



Posted in Sparksee, Use Case | Tagged , , , , | Comments Off on Graph database use case: Insurance fraud detection

Management of mobile device data

** Note: This is a curated article published first at Quora by Sparsity’s CEO Mr. Larriba Pey **

The content stored in Mobile Devices (MD) grows as the users evolve in their tastes, the trends in applications change and the needs for each work environment grow. This way, the users of MDs keep increasing the amount of data and metadata generated as well as the Apps installed in their device. Also, the users keep growing their interaction with applications like Twitter, Facebook or LinkedIn, increasing the amount of own data managed by third parties.

Time and practice will show that Mobile Graph Databases (M-GDBs) will be the perfect match to manage and query all those datasets for two reasons: the management of a single data repository will provide added-value linked data and the querying capabilities will be rocketed with M-GDBs.

Added value linked data. With M-GDBs, one single data management system will allow all the mobile Apps accessing a significant variety of data, turning this into added value linked data for the user including: friends, topics, metadata for image and video content, own data stored by third party applications, applications’ usage, GPS localisation, weather forecasts, etc.

For instance, using the M-GDB to automatically disambiguate the phone and e-mail contacts using the calls performed and the mails sent and received will provide a single source of increased-value Social Data. Going further, M-GDBs will also allow automatically linking the MD contacts with the data that can be obtained through the Social Network APIs from Twitter, Facebook or LinkedIn and others.

Once linked, it will be possible to enrich those Social data with metadata about the photos taken with the MD camera, the GPS information about current location or the weather information provided by third party public APIs.

Rocketing the querying capabilities. In addition to the capabilities of Relational DBs, M-GDBs will further allow graph oriented queries providing added value features.

Queries like the following will be easy to implement providing significant added value information: Among my closest contacts (friends or FOAFs), who have similar tastes than I so that I can send them the last photo taken with my MD? Is there a friend or a FOAF who lives in the place I am visiting and I could call or send a mail? Can I have attractions recommended using my friends or FOAFs social review information?

Sparksee 5 mobile is the only M-GDB in the market covering a full range of Operating Systems like Android, iOS and BB10. Request your download now!

If you are interested in mobile technology regarding big data and/or in-device analytics stay tuned for our Twitter next week, we are going to be at the 2015 edition of the international MWC (Mobile World Congress) that will be held at Barcelona sharing the latest developments.

Posted in Sparksee, Sparksee mobile | Tagged , , , , , | Comments Off on Management of mobile device data

Graph database use case: business intelligence applications of Indoor Positioning Systems

Estimote iBeacon and iPhone 6. Picture by Jonathan Nalder.

Estimote iBeacon and iPhone 6. Picture by Jonathan Nalder.

Since the Real Time Locating Systems (RTLS) entered the market back in the 1990s, there has been several attempts to create a reliable secure system to locate objects & people nearby in indoor environments. That is not surprising given that people spend most of the time inside buildings, where space-based satellite location systems like GPS suffer from signal attenuation. There has been a large number of technology approaches after RTLS, most of them based on radio waves and radio signals. It wasn’t until Bluetooth enabled devices became more popular, that the first beacon-based Indoor Positioning Systems (IPS) like Apple’s iBeacon and its android-based homologous Datzing entered the market, allowing for indoor location, mapping, geofencing and proximity detection.

These technologies bring the possibility to acquire relevant in-store behaviour data from customers, which could become a key factor to improve the customer’s experience while, for instance, shopping; developing new marketing strategies and boost the efficiency of the spatial organization of buildings and stores. Let’s see a couple of use cases where graph database technology could be key to develop high performance solutions for indoor positioning analysis applications.

Product placement optimization

For both examples we will consider a clothing store with several collections, with each collection placed in a certain spot. Sometimes customers may find the collections they like close to each other, but more frequently they not, resulting in a loss of interest and probably with one customer walking away. A beacon-based product placement optimization application could be the solution to this problem. Given the nature of the data, using a graph database like Sparksee could make a difference on the performance of such an application. Consider the example that follows:

When a customer is browsing a collection (e.g. stands more than 30 seconds in a collection spot) it becomes a node in the graph. Every time a customer goes from one collection to another we can create a weighted edge between them. If the same pattern of behavior is repeated (by the same user or another user) the relation between these collections is stronger, increasing that weight between the two nodes. We can then discover a path that optimizes the weight between two spots or nodes, navigating through all the nodes included in the graph, because we wish to place each of the collections inside the building. This is an example of an application for the optimal placement of certain products in a store, which could be used also to predict the location of further products.

In-store advertising

Another potential use case of graph databases and beacon-based Indoor Positioning Systems is presenting offers and ads based on prior customer behavior. From a marketing point of view, it is not efficient to advertise the same products to every client, given that they have different tastes and needs. Using the patterns that we acquired through the process described on the first use case, we could optimize the ads and offers that are presented to each customer. This would result in a better experience for the customer and in a greater probability of them purchasing the item announced. This ads could be presented via smartphone application or also through monitors placed on the walls in the store. Having a mobile graph database like Sparksee would allow the application to be updated based on the customer’s current movements in the shop and his and similar costumer’s previous behaviours on real time showing the customer ads that could trigger his attention to a particular part of the shop.

You can find more Sparksee use cases, tutorials and other useful resources in the Sparsity Technologies website, our blog or Sparsity’s social media channels. Also remember that you can download Sparksee for free and start using it for your own project.

Stay tuned and follow us for more graph databases use cases inspiration!

Posted in Sparksee, Sparksee mobile, Use Case | Tagged , , , , , , | Comments Off on Graph database use case: business intelligence applications of Indoor Positioning Systems

Sparksee’s seminar at BarcelonaTech

Sparsity will teach an introductory course to Sparksee for students at BarcelonaTech. The course is part of the Seminari d’empresa 2015 initiative that pretends to be a hub between IT companies and the university students so they can learn about the latest industry advances.

Sparksee’s course will be divided in three days of about 3 hours each part:

– Part 1: Introduction to Graph Databases & to Sparksee and why we claim the high performance for large volume of data. The first part of the seminar will include some interesting graph database use cases like root cause analysis or enterprise staff analysis.

– Part 2 : Hands-on tutorial that will cover the basics of Sparksee and the first graph operations. Students will learn how to create their first Sparksee graph database, add some data and work with its first low level operations.

– Part 3: Second part of the hands-on tutorial, where the students will face advanced queries such as page rank or finding communities, which will make visible the strengths of graph databases and how to take advantage of their characteristics to create higher performance solutions.

Sparsity is glad to be part of this BarcelonaTech initiative again for this 2015 edition to make graph databases more known among the University students.

Posted in Events, Sparksee | Tagged , , , , , , | Comments Off on Sparksee’s seminar at BarcelonaTech