Query Expansion by Exploiting Graph Properties for Image Retrieval

When searching images for a certain query, ambiguity problems may arise. For instance, when searching for the query “colored Volkswagen beetle” the most common search engines would retrieve some images with the famous car, but also the bug and even a bird that also recalls for that name. Query expansion would avoid this problem by using a context to the query and adding related concepts that may additionally enrich the results or increase its precision. For example and going back to the former example, it would be maybe more useful that the results would contain images of other cars similar to the Volkswagen beetle rather than the fauna images.

Query Expansion - Sparsity Technologies

Figure 1: Image retrieval with query expansion process diagram

In Figure 1 the process for image retrieval including Query expansion is shown. It starts with the original query Qo which will be the entry point for the Query Expansion module which will then search for several complimenting terms and phrases that will be combined to deliver a Qe enriched query. Qe will be introduced in the regular search engine. The big question here is which are those terms to enrich the query and how we do find them? Accordingly to the research “Massive Query Expansion by Exploiting Graph Knowledge Bases for Image Retrieval” by authors Joan Guisado-Gámez, David Dominguez-Sal and Josep L. Larriba-Pey the answer may be graph techniques.

Their method is based on the topological analysis of the graph built out of a knowledge base. They differ from previous works because they are not considering the links of each article individually, instead they are mining the global link structure of the knowledge base to find related terms using graph mining techniques. The technique consist in, first, locating the most relevant terms in the knowledge base, and connecting them by a path using the knowledge base relations, and second, extracting communities of concepts around the detected path. With this technique they are able to identify: (i) concepts that match with the user need, and (ii) a set of semantically related concepts. Matching concepts provide equivalent reformulations of the query that reduce the vocabulary mismatch. Semantically related concepts introduce a significant set of different terms that are likely to appear in a relevant document.

While in the first phase of this innovative Query Expansion technique it uses the shortest path to find the most direct way to connect a particular word to another related, on the second they are using community search on a graph to enrich the previously computed paths with articles that are closely related.

For the example query, the system finds 182 shortest paths using the English Wikipedia. Among them, nine score 3/2 that is the top score:

volkswagen→volkswagen beetle
volkswagen fox→volkswagen beetle
volkswagen passat→volkswagen beetle
volkswagen type 2→volkswagen beetle
volkswagen golf→volkswagen beetle
volkswagen jetta →volkswagen beetle
volkswagen touareg→volkswagen beetle
volkswagen golf mk4 →volkswagen beetle
volkswagen beetle→volkswagen transporter
The first path in the list is specially relevant because it connects the generic concept Volkswagen to the most specific context Volkswagen Beetle. The rest of the paths are also valuable because they refer to the real intent of the user and discard any path that contains articles about other interpretations of the term beetle.

For the second part of the technique, the most direct solution would be to enrich the path with all the neighbors of the Wikipedia articles. However, when they tested this naive solution, it did not work because it introduced articles that were loosely related to the path. Wikipedia articles have typically many links, and many of them refer to topics that have some type of relation but semantically are very distant. They implemented a community search algorithm to distinguish the semantically strong and weak links. A community in a graph is a set of closely linked nodes which are similar among them but are different from other nodes in the rest of the graph. On our example, on this second part of the query expansion process related terms such as German cars, Volkswagen group, vw bug, VW Type 1, Wolfsburg (which is the city where they were manufactured), Baja bugs (which refers to an original Volkswagen Beetle modified to operate off-road) or Cal Look (name used to refer to customized version of Volkswagen Beetle cars that follow a style coined in California in 1969) are added to enrich the results.

Check the complete article for the experimental results and precision measures!

Read more articles about research using graph databases in our blog. If you think your research will benefit from using a graph database consider using Sparksee, available for Phyton, Java, .Net, C++ and Objective-C. Request one of our FREE licenses under our Research Program.

Posted in Research | Tagged , , , , , , , | Leave a comment

Graph database use case: Insurance fraud detection

According to a fact sheet released by the Southwest Insurance Information Service (SIIS), Approximately 10% of all insurance claims are fraudulent, and nearly $80 billion in fraudulent claims are spent annually in the U.S., estimates the Coalition Against Insurance Fraud. Insurance fraud is certainly an issue that must be addressed given the benefits both the insurer and the insured will obtain from its prevention: the insurance buyer is able to receive coverage at a lower price, which gives the insurance company a competitive advantage.

Insurance fraud can be perpetrated by the seller or the buyer. Seller fraud occurs when the seller of a policy hijacks the usual process, in a way that maximizes his or her profit. Some examples are premium diversion, fee churning, ghost companies and worker’s compensation fraud. Buyer fraud occurs when the buyer deliberately invents or exaggerates a loss in order to obtain more coverage or receive payment for damages. Some examples are false medical history, murder for proceeds, post-dated life insurance and faking accidents.

Traditional methods to detect and prevent this form of fraud include duplicate testing, using date validation systems, calculating statistical parameters to identify outliers, using stratification or other types of analysis to identify unusual entries, and identifying gaps on sequential data. These methods are a great way to catch most of the casual, single fraudsters, but sophisticated fraud rings are usually well-organized and informed enough to avoid being spotted by the traditional means. They use layered “false” collusions in a similar way than money laundry rings.

In this scenario, where implementing alternative fraud detection methods is crucial, graph database management systems play a significant role. In the case of buyer fraud, the only way to catch the complex layered collusion performed by criminal rings is to analyze the relationships of the elements involved in the claim, which is a tedious task to perform on a relational database. While a RDBMS has to join a large number of tables –accidents, companies, drivers, lawyers, witnesses…- in complex schemes, a GDBMS only has to traverse the graph considering the relationships between the nodes, which is significantly more efficient, especially in cases that require querying large datasets. An example of buyer insurance fraud represented as a graph would be as follows:

graph post insurance 2

Example 1: Simple buyer insurance fraud case represented as a graph.

On this example, subjects 5 and 4 participate on both accidents in a direct way. Subject 1 is also related to both accidents in an indirect way, given that the car he drives is owned by the driver of Car 3, who is involved in Accident 2, which means Subject 1 and 5 must know each other. With the added value of social networks analytics— another scenario where graph DBMS usually outperform relational DBMS–we have one more clue pointing to our suspecting of fraud: Subject 3 and subject 6 are friends on Facebook, meaning that 5 out of 6 nodes of the graph are related to both accident 1 and accident 2 in some way, which should ring a bell to a possible fraud involved. In addition a graph database would make easy to add data from different sources in a changing schema and for instance move from subject 1 to 6 to see that they are in fact also related. To make it clear and easy to visualize we have used an example of a fraud ring that claims only two false accidents, but real cases of large fraud rings usually result in greater number of claims, where the relationships between the people involved can be hardly explained by coincidence.

A graph database could also be used to search for fraud across different insurance companies to find similarities in patterns and behaviors that could add value to an analysis like the one showed on the example above.

As we have said, a high performance graph database like Sparksee is a perfect match to deal with large amounts of data in situations where a deep relationships analysis is required. Remember that you can download it for free under evaluation or research license and use it for your own project. You can find more graph database use cases, scenarios and success stories searching for the “use case” tag on the blog or visiting the “scenarios” section of our website.



Posted in Sparksee, Use Case | Tagged , , , , | Leave a comment

Management of mobile device data

** Note: This is a curated article published first at Quora by Sparsity’s CEO Mr. Larriba Pey **

The content stored in Mobile Devices (MD) grows as the users evolve in their tastes, the trends in applications change and the needs for each work environment grow. This way, the users of MDs keep increasing the amount of data and metadata generated as well as the Apps installed in their device. Also, the users keep growing their interaction with applications like Twitter, Facebook or LinkedIn, increasing the amount of own data managed by third parties.

Time and practice will show that Mobile Graph Databases (M-GDBs) will be the perfect match to manage and query all those datasets for two reasons: the management of a single data repository will provide added-value linked data and the querying capabilities will be rocketed with M-GDBs.

Added value linked data. With M-GDBs, one single data management system will allow all the mobile Apps accessing a significant variety of data, turning this into added value linked data for the user including: friends, topics, metadata for image and video content, own data stored by third party applications, applications’ usage, GPS localisation, weather forecasts, etc.

For instance, using the M-GDB to automatically disambiguate the phone and e-mail contacts using the calls performed and the mails sent and received will provide a single source of increased-value Social Data. Going further, M-GDBs will also allow automatically linking the MD contacts with the data that can be obtained through the Social Network APIs from Twitter, Facebook or LinkedIn and others.

Once linked, it will be possible to enrich those Social data with metadata about the photos taken with the MD camera, the GPS information about current location or the weather information provided by third party public APIs.

Rocketing the querying capabilities. In addition to the capabilities of Relational DBs, M-GDBs will further allow graph oriented queries providing added value features.

Queries like the following will be easy to implement providing significant added value information: Among my closest contacts (friends or FOAFs), who have similar tastes than I so that I can send them the last photo taken with my MD? Is there a friend or a FOAF who lives in the place I am visiting and I could call or send a mail? Can I have attractions recommended using my friends or FOAFs social review information?

Sparksee 5 mobile is the only M-GDB in the market covering a full range of Operating Systems like Android, iOS and BB10. Request your download now!

If you are interested in mobile technology regarding big data and/or in-device analytics stay tuned for our Twitter next week, we are going to be at the 2015 edition of the international MWC (Mobile World Congress) that will be held at Barcelona sharing the latest developments.

Posted in Sparksee, Sparksee mobile | Tagged , , , , , | Leave a comment

Graph database use case: business intelligence applications of Indoor Positioning Systems

Estimote iBeacon and iPhone 6. Picture by Jonathan Nalder.

Estimote iBeacon and iPhone 6. Picture by Jonathan Nalder.

Since the Real Time Locating Systems (RTLS) entered the market back in the 1990s, there has been several attempts to create a reliable secure system to locate objects & people nearby in indoor environments. That is not surprising given that people spend most of the time inside buildings, where space-based satellite location systems like GPS suffer from signal attenuation. There has been a large number of technology approaches after RTLS, most of them based on radio waves and radio signals. It wasn’t until Bluetooth enabled devices became more popular, that the first beacon-based Indoor Positioning Systems (IPS) like Apple’s iBeacon and its android-based homologous Datzing entered the market, allowing for indoor location, mapping, geofencing and proximity detection.

These technologies bring the possibility to acquire relevant in-store behaviour data from customers, which could become a key factor to improve the customer’s experience while, for instance, shopping; developing new marketing strategies and boost the efficiency of the spatial organization of buildings and stores. Let’s see a couple of use cases where graph database technology could be key to develop high performance solutions for indoor positioning analysis applications.

Product placement optimization

For both examples we will consider a clothing store with several collections, with each collection placed in a certain spot. Sometimes customers may find the collections they like close to each other, but more frequently they not, resulting in a loss of interest and probably with one customer walking away. A beacon-based product placement optimization application could be the solution to this problem. Given the nature of the data, using a graph database like Sparksee could make a difference on the performance of such an application. Consider the example that follows:

When a customer is browsing a collection (e.g. stands more than 30 seconds in a collection spot) it becomes a node in the graph. Every time a customer goes from one collection to another we can create a weighted edge between them. If the same pattern of behavior is repeated (by the same user or another user) the relation between these collections is stronger, increasing that weight between the two nodes. We can then discover a path that optimizes the weight between two spots or nodes, navigating through all the nodes included in the graph, because we wish to place each of the collections inside the building. This is an example of an application for the optimal placement of certain products in a store, which could be used also to predict the location of further products.

In-store advertising

Another potential use case of graph databases and beacon-based Indoor Positioning Systems is presenting offers and ads based on prior customer behavior. From a marketing point of view, it is not efficient to advertise the same products to every client, given that they have different tastes and needs. Using the patterns that we acquired through the process described on the first use case, we could optimize the ads and offers that are presented to each customer. This would result in a better experience for the customer and in a greater probability of them purchasing the item announced. This ads could be presented via smartphone application or also through monitors placed on the walls in the store. Having a mobile graph database like Sparksee would allow the application to be updated based on the customer’s current movements in the shop and his and similar costumer’s previous behaviours on real time showing the customer ads that could trigger his attention to a particular part of the shop.

You can find more Sparksee use cases, tutorials and other useful resources in the Sparsity Technologies website, our blog or Sparsity’s social media channels. Also remember that you can download Sparksee for free and start using it for your own project.

Stay tuned and follow us for more graph databases use cases inspiration!

Posted in Sparksee, Sparksee mobile, Use Case | Tagged , , , , , , | Leave a comment

Sparksee’s seminar at BarcelonaTech

Sparsity will teach an introductory course to Sparksee for students at BarcelonaTech. The course is part of the Seminari d’empresa 2015 initiative that pretends to be a hub between IT companies and the university students so they can learn about the latest industry advances.

Sparksee’s course will be divided in three days of about 3 hours each part:

– Part 1: Introduction to Graph Databases & to Sparksee and why we claim the high performance for large volume of data. The first part of the seminar will include some interesting graph database use cases like root cause analysis or enterprise staff analysis.

– Part 2 : Hands-on tutorial that will cover the basics of Sparksee and the first graph operations. Students will learn how to create their first Sparksee graph database, add some data and work with its first low level operations.

– Part 3: Second part of the hands-on tutorial, where the students will face advanced queries such as page rank or finding communities, which will make visible the strengths of graph databases and how to take advantage of their characteristics to create higher performance solutions.

Sparsity is glad to be part of this BarcelonaTech initiative again for this 2015 edition to make graph databases more known among the University students.

Posted in Events, Sparksee | Tagged , , , , , , | Leave a comment

How & when to use the recovery functionality

On this new edition of Sparksee’s how-to series, we would like to highlight the recovery functionality that will keep your database save at all times and it’s specially recommended for first-time users.

Sparksee includes an automatic recovery manager which keeps the database safe for any eventuality. In case of application or system failures, the recovery manager is able to bring the database to a consistent state in the next restart.

By default the recovery manager is disabled but we recommend, specially for new Sparksee users, to enable it before starting to construct your first graph database. The recovery can be set at SparkseeConfig time, which should be your first line of code when creating your database(*):

 SparkseeConfig cfg = new SparkseeConfig();
 Sparksee sparksee = new Sparksee(cfg);

The recovery has the following variables to set:

  • sparksee.io.recovery: Set it to true to enable the recovery.
  • sparksee.io.recovery.logfile: Set the name & path of the recovery log file, otherwise it will be stored in the same path as your database. Remember that the extension for this file is .log
  • sparksee.io.recovery.checkpointTime: Set the time – in microseconds – when the recovery will copy the committed transactions at the recovery log. By default it’s 60 seconds (60000000).
  • sparksee.io.recovery.cachesize: Set the size of your recovery cache. We don’t recommend changing the default option.

Here is an example of a typical configuration for the recovery functionality:

SparkseeConfig cfg = new SparkseeConfig();
cfg.setRecoveryEnabled(true); // Enabling the recovery
cfg.setLogFile("recoverylogfile.log"); // it will be stored in the execution directory, same as your database
cfg.setRecoveryCheckpointTime(90000000); //we are setting it to 1.5 minutes

And why isn’t the recovery enabled on the first place? The recovery introduces a small penalty in the performance that strongly depends on the checkpoint time, therefore we allow the user with the knowledge about the characteristics of its application and its typical update patterns, to discern which compromise can be made in order to achieve the highest possible performance while keeping the database the most secure. If the user is actively aware of this functionality he will be able to take the maximum of it although the default parameters are used.

Don’t forget to tell us if you are using the recovery and how; your feedback is key to make Sparksee grow!

(*) Examples are shown in Java, please refer to your language of choice in the User Manual chapters Configuration & Maintenance and Monitoring.

Posted in Documentation, Sparksee | Tagged , , , | Leave a comment

Graph Database Use Case: Fraud detection

Fraud and financial crimes are a form of theft or larceny that occur when a person or entity takes money or property for their own use, or uses them in an illicit manner for their personal benefit. These crimes typically involve some form of deceit, subterfuge or the abuse of a position of trust, which distinguishes them from common theft or robbery.

For most countries, one of the financial crimes which is more difficult to prevent, detect and prosecute is money laundering. Money laundering is the process in which the proceeds of crime are transformed into apparently legitimate money or other assets. These kind of processes usually follow specific transaction patterns that can be simplified as the following (see figure 1):

1) Collecting the money coming from illegal activities.
2) Placing it into a depositary institution.
3) Adding a layer to the transaction (such as a payment of a false invoice or a loan to another company).
4) Integrating the money into the financial system by purchasing financial/industrial investments, luxury assets etc.


Figure 1 – Diagrammatic description of a money laundering scheme by ExplicitImplicity under CC-BY-SA-3 and GFDL.

All the information regarding these transactions is registered by the banks and financial entities that take part in the process, and it can be represented as a graph, being each entity (person, company, organization…) involved a node and each transaction an edge of the network. Then, a fraud detection application would compare the before-hand known transaction patterns of previous prosecuted fraud cases with the patterns of our network to analyze if there are common points between them. Figure 2 is an example of a graph representing a money laundering fraud.

money_laundering_graph (3)

Figure 2 – Money laundering graph example.

In this case, Subject X transfers the illicit proceeds to the associate Company Y (placement), which pays a false invoice coming from Company Z. Company Z makes a loan to Company Y for the same amount than the false invoice, adding a layer to the process and making the fraud more difficult to spot. More layers can be added at this point, for instance, purchasing chips on a casino and changing them again for their value. Then, Company Y invests on a legit financial institution to integrate the money into the financial system, and finally it withdraws the capital transferring the earnings back to Subject X, who receives the “clean” money. As you can glimpse from Figure 2 a graph representation of the information would help us to more easily identify the loop that makes Subject X suspect of a possible fraudulent transaction.

Although all the connections happen necessarily at a specific point of time -e.g. Company Y cannot transfer the “clean” money to Subject X before making all the other transactions- note that we don’t need this information to compare one pattern to another.

Other similar use cases involving graph databases for fraud detection include tax evasion and illegal funding, where the key aspect also lies into searching known irregular patterns in the transactions graph.

If you want to know more about graph database use cases, scenarios and success stories, you can search for the “use case” tag on the blog or visit the “scenarios” section of our website. Remember that you can download Sparksee 5.1 for free and use it for your project!

Posted in Sparksee, Use Case | Tagged , , , | Leave a comment

Recap of the year and future outlook for 2015


Approaching the end of this 2014 we believe it’s a good time to look back and take stock of all that we have been working on and happened to Sparksee on this year.

One of the most important hits for 2014 has been the release of Sparksee 5.1. Key features like the new Objective-C API, an enhanced compatibility with Blueprints, the dynamic size adapting cache, the compatibility with Visual Studio 2013 and the rollback functionality have meant a great step forward for our high performance graph database.

During the year, Sparsity has also started a Tetracom Technology Transfer Project and joined the European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC) in order to further improve academia-industry interaction. We have also kept involved with the Linked Data Benchmark Council (LDBC) and the Coherent PaaS European Projects.

2014 has also been a year full of interesting events related to the graph and NoSQL world. Sparsity attended GraphLab in San Francisco, NoSQL Matters in Barcelona, the LDBC TUC meeting in Athens, the ICT Proposers’ Day in Florence, SIGMOD and Grades in Salt Lake City and BizBarcelona and the MWC in Barcelona. Hope we were able to meet during one of the former events! In case we still haven’t, you will definitely find us in the 2015 graph database event arena.

Last but not least, we have been able to share and interact with all the Sparksee community through Twitter, Facebook, Google+ and our blog. During this year we have published 17 posts including tutorials, use cases, news & events, Sparksee technical details and research articles. Don’t hesitate to check them out using our archive on the sidebar.

A lot of positive things have already happened during the year, but there’s a lot more to come on 2015.  Sparsity will keep moving forward thanks to your feedback and contributions, to bring together our high performance solutions to the next level.

The Sparsity Technologies team wishes you the best for the holiday season and a happy new year 2015!

Posted in News, Sparksee | Tagged , , , , | Leave a comment

SNA: How to predict the most viral users with Sparksee

Social Network Analysis (SNA) is one of those Use Cases that everyone mentions when talking about the strengths of graph databases. It’s not a secret that the network of people interacting together makes instantly a good image of a graph in everyone’s head. Once you have constructed the social graph it opens plenty of possibilities to explore it wisely in order to effectively answer questions like the one we are going to deal in today’s post: how to discover whom is more likely to make my message viral in the network.

To give a more insight about how to construct a good algorithm that will find us the most viral users and which exploits the capabilities of the graph we are using the literature and will refer to the “greedy algorithm”. For those still not familiar with this algorithm let us introduce its definition.

The greedy algorithm starting with a solution tree is able to calculate those solutions that maximise a defined function f(n). Therefore for each iteration the algorithm will take a look at the child nodes of a certain source node and select the one that maximises the f(n) and move forward.

Figure 1.0 shows an example of an execution of the greedy algorithms. Blue nodes are the ones already included in our solution, yellow nodes are the ones being evaluated with our function f(n) (we also call these nodes candidates) and the white nodes that are those never visited and thus not evaluated. It’s of vital importance that we are able to establish the best heuristic for f(n) so the algorithm delivers the optimal solution. We can see how important is to tune the algorithm on the example shown in Figure 1.0 where we are looking for three consecutive nodes that maximise the sum of their values. A simple greedy algorithm will answer the blue nodes (5, 7 and 5) while a more optimal solution in our example would be nodes 5, 3 and 50.

greedy algorithm example (Viral users)

Figure 1.0 – Example of a Greedy Algorithm 

Let’s see then which ideas you could use to construct a good function for a greedy algorithm to discover the most viral users in our social network. For each node (users) you can evaluate a weight so the greedy algorithm can move through the ones that maximise that propagation weight. The measure of propagation should take into account things like the previous propagations of that user, that propagation could also be valued against the rest of propagations of the other users or the number of documents ever propagated by that user. Also one important matter that we could maybe consider are restricting to only previous propagations from a similar theme.

With all those ideas you should be able to tune your own and unique function of propagation that could then be used in an algorithm such as the following:

 Require: A graph G and a node N
 Ensure: I are infected by N
 1: I = empty set;
 2: P = pendent nodes with N queue;
 3: V = visited set;
 4: while P no empty do
 5: x P.dequeue();
 6: edges edges(x, source);
 7: for edge 2 edges do
 8: tail = edge.tail();
 9: if V not contains tail then
 10: V union( V, tail);
 11: P union( P, tail);
 12: if Math:random() > edge:weight() then
 13: if not tail 2 I then
 14: I union( I, tail);
 15: end if
 16: end if
 17: end if
 18: end for
 19: end while
 20: return I;

Hope you find our successful story of using Sparksee with this greedy algorithm to discover the most viral users inspiring for your Social Networks Analysis projects.

Download now Sparksee for free, start building your social graph to search for propagation constructing the algorithms explained here!

Posted in SNA, Sparksee | Leave a comment

Learning high-performance graph database management with Sparksee at the NoSQL matters Training Session

On Friday 21st of November from 9h to 13h Sparsity will host a Training Session as part of the NoSQL matters events.

Skilled trainers from Sparksee will explain to the attendees how to take advantage of the graph learning about the most common queries that are best suited to be answered using a graph. The training will take Twitter model and dataset to build the graph and then will cover queries to the resulting graph such as discovering how two twitter users are connected.


Attendees will be given a Netbeans project with Sparksee Java and a complete set of exercises to fill in the blanks. Also they will be gifted with a free development license to build graphs up to 1B objects and unlimited sessions during 6 months.

Looking forward to meet you at the NoSQL matters Training Session!

Posted in Events, News, Sparksee | Leave a comment