I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia).
For example "Bad is a song by Mikael Jackson" should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad).
Ideally the system should work across multiple languages, it should work both on short texts and long texts, and when it is unsure it should return multiple topics (eg. Bad song + Bad album). Also, it should ideally be open source and have a python API.
Yes, that sounds like a list for Santa Claus. Any ideas?
Edit
I checked out a few solutions, but no silver bullet so far.
- NLTK parses text and extract "named entities" (AFAIU, a part of a sentence that refers to a name), but it does not return Wikidata topics, just plain text. This means that it will likely not understand that "I shot the sheriff" is the name of a song by Bob Marley, it will instead treat this as a sentence.
- OpenNLP does roughly the same.
- Wikidata has a search API, but it's just one term at a time, and it does not handle disambiguation.
- There are a few commercial services (OpenCalais, AlchemyAPI, CogitoAPI...) but none really shines, IMHO.