Sparsity Blog Graph Structure of Wikipedia

Understanding Graph Structure of Wikipedia for Query Expansion

Knowledge bases are very good sources for knowledge extraction, the ability to create knowledge from structured and unstructured sources and use it to improve automatic processes as query expansion.

Wikipedia, in particular, could be analyzed to see how articles and categories relate to each other and how these relationships can support a query expansion technique. In particular, the authors of this article show that the structures in the form of dense cycles with a minimum amount of categories tend to identify the most relevant information.

Query expansion is the process of expanding a query issued by a user, introducing new terms, called expansion features, in order to improve the quality of the retrieved results.

Understanding-Graph-Structure-of-Wikipedia-for-Query-Expansion

Query expansion is motivated by the assumption that the query introduced by the user is not the best to express its real intention. For example, vocabulary mismatch between queries and documents is one of the main causes of a poor precision in information retrieval systems. Poor results also arise from the topic inexperience of the users. The challenge is to properly select the best expansion features.

Wikipedia has been proven to be a good source for query expansion, but the innovation in this paper lies in the fact of considering the differences between a social network and a knowledge base by:

Creating a ground truth consisting of those articles in Wikipedia that provide good results for each of the queries that are the baseline in the experiments.
Analysing how the articles and categories of the ground truth are structured within the Wikipedia graph.
Identifying cycles of articles and categories as an important structure and also trends within them. 30% of the dense cycles with minimum ratio of categories, are tagged as the best expansion features.
Identifying challenging and open problems for graph processing technologies when it comes to exploit structures of large graphs such as Wikipedia

A quick analysis of the query graphs reveals that they are, in general, disconnected graphs composed by a moderately large connected component. This is an interesting observation as it means that, in general, the terms users introduce in a search engine are semantically related either directly or by means of extra articles or categories. This suggests that Wikipedia, contains this semantic relation encoded within its structure, and therefore, can be exploited. Also, we observe that the largest connected component is clearly dominated by categories.

If you are using graphs for your research don’t hesitate to request being part of our Research program where we grant free licenses of Sparksee.

Get free trial

← Bringing BI to SMEs of the fashion industry Spot the differences: Google Maps vs Route Optimization →

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.