2

I am trying to find named entities in a given text. For that, I have tried using DBPedia spotlight service.

  1. I am able to get a response out of that. However, the DBPedia dataset is limited, so I tried replacing their spotter.dict file with my own dictionary. My dictionary contains entities per line:

    Sachin Tendulkar###PERSON

    Barack Obama ###PERSON

    .... etc

  2. Then I parse this file and build an ExactDictionaryChunker object.

  3. Now I am able to get the entities and their types (after modification of dbpedia code).

My Question is: DBPedia spotlight is using Lucene Index files. I really don't understand for what purpose they are using these files?

Can we do it without using Index files? Whats the significance of the index files?

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
Sreedhar GS
  • 2,694
  • 1
  • 24
  • 26
  • Looks like there is some explanation of how Lucene is used in their [Github wiki](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Lucene---Architecture) – femtoRgon Feb 21 '14 at 17:04
  • Thanks for your response.. But here they not at all discussed about lucene index. It's too abstract. – Sreedhar GS Feb 25 '14 at 08:18

1 Answers1

0

Lucene was used in the earlier implementation of DBpedia Spotlight to store a model of each entity in our KB. This model is used to give us a relatedness measure between the context (extracted from your input text) and the entity. More concretely, each entity is represented by a vector {t1: score1, t2: score2, ... }. At runtime we model your input text as a vector in the same dimensions and measure the cosine between input vector and entity vectors. In your case, you would have to add a vector for Sachin Tendulkar to the space (add a document to the Lucene index) in case it is not already there. The latest implementation, though, has moved away from Lucene to an in-house in-memory context store. https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)

Pablo Mendes
  • 391
  • 1
  • 8