2

I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia).

For example "Bad is a song by Mikael Jackson" should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad).

Ideally the system should work across multiple languages, it should work both on short texts and long texts, and when it is unsure it should return multiple topics (eg. Bad song + Bad album). Also, it should ideally be open source and have a python API.

Yes, that sounds like a list for Santa Claus. Any ideas?

Edit

I checked out a few solutions, but no silver bullet so far.

  • NLTK parses text and extract "named entities" (AFAIU, a part of a sentence that refers to a name), but it does not return Wikidata topics, just plain text. This means that it will likely not understand that "I shot the sheriff" is the name of a song by Bob Marley, it will instead treat this as a sentence.
  • OpenNLP does roughly the same.
  • Wikidata has a search API, but it's just one term at a time, and it does not handle disambiguation.
  • There are a few commercial services (OpenCalais, AlchemyAPI, CogitoAPI...) but none really shines, IMHO.
amirouche
  • 7,682
  • 6
  • 40
  • 94
MiniQuark
  • 46,633
  • 36
  • 147
  • 183
  • @Matt1776 Perhaps the question sounded too vague, but it's a real life programming problem I'm facing, and I'm sure others have had (or will have) the same. Anyway, thanks a lot for your kind upvote. – MiniQuark Nov 08 '16 at 18:28
  • what do you think of my answer? – amirouche Nov 10 '16 at 18:47
  • Hi @amirouche, thanks for your answer. I like the idea (+1) but I can't accept this answer: I tried something similar using NLTK, and unfortunately it fails when a topic has a name that looks like a piece of sentence, for example "I shot the sheriff" by Bob Marley. This actually happens very often with song names or artist names (eg. Rage against the machine). I'm leaning towards a solution that will first look for all the topic names it can find (even approximately), then rank them using various signals, including perhaps some NLP signals (is it a noun? Is it a person?). What do you think? – MiniQuark Nov 15 '16 at 15:19
  • “various signals” is too vague for me to take a position. Also you should use some kind of syntactic parsing or dependency parse tree to find out which part of the sentence can be interesting to search in wikidata. – amirouche Nov 15 '16 at 20:21
  • Google recently released the https://cloud.google.com/natural-language, I think you should try it. It will point you to a Wikipedia article about the object, but from there you can get its wikidata page. – marfi Dec 09 '16 at 15:49
  • apparantly this is called wikification have a look at this thread on wikidata mailling list https://lists.wikimedia.org/pipermail/wikidata/2017-February/010252.html – amirouche Feb 07 '17 at 16:50

1 Answers1

3

You can use Spacy to retrieve Named Entity then link them to WikiData using the search API.

For what remains of the sentence that is not matched as named entity by Spacy you can create a list of ngrams from the sentence starting with the biggest ngram you use the WikiData search API to lookup WikiData topics.

POS tagging can be put to good use, that said syntax parse informations is more powerful since you can know the relations between the words. For instance given the following output from link-grammar:

Found 8 linkages (8 had no P.P. violations)
    Linkage 1, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)

    +-------------------------Xp-------------------------+
    +----------->WV---------->+                          |
    +-------Wd------+         +---------Osn--------+     |
    |       +---G---+----Ss---+----Os----+         |     |
    |       |       |         |          |         |     |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] . 

You can tell that the subject is “Bob Marley” because

  1. “wrote” is connected to “Marley” with a S which connects subject nouns to finite verbs.
  2. “Marley” is connected to “Bob” using a G which connects proper noun together.

So a “Bob Marley” is a good candidate for an entity (also it has both word capitalized).

Given the above parse "tree" it difficult to tell whether “Natural” and “Mystic” are related even if they are on the same side of the sentence.

The second parse provided by link grammar has the same cost vector and links together “Natural Mystic” with again a G.

Here is it:

    Linkage 2, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)

    +-------------------------Xp-------------------------+
    +----------->WV---------->+                          |
    +-------Wd------+         +---------Os---------+     |
    |       +---G---+----Ss---+          +----G----+     |
    |       |       |         |          |         |     |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .

So in my opinion “Bob Marley” and “Natural Mystic” are good candidate for a wikidata search.

That was the easy problem where grammar and spelling are correct.

Here is one parse out of 11 of the same sentence with lower cases:

Linkage 1, cost vector = (UNUSED=1 DIS= 0.15 LEN=14)

    +------------------------Xp------------------------+
    +----------------------Wa---------------------+    |
    |       +------------------AN-----------------+    |
    |       |        +-------------AN-------------+    |
    |       |        |                  +----AN---+    |
    |       |        |                  |         |    |
LEFT-WALL Bob.m marley[?].n [wrote] natural.n mystic.n . 

LG doesn't even recognize the verb.

amirouche
  • 7,682
  • 6
  • 40
  • 94