1

I am using DBpedia Spotlight to extract DBpedia resources as follows.

import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse

## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients.     In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
    text=urllib.parse.quote_plus(TEXT), 
    confidence=CONFIDENCE, 
    support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []

r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
    all_urls.append(res['@URI'])
print(all_urls)

My text is shown below:

Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions.

The results I got is as follows.

['http://dbpedia.org/resource/Hedera', 
'http://dbpedia.org/resource/Helix', 
'http://dbpedia.org/resource/Bronchitis', 
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera', 
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']

As you can see, the results are not very good.

For example, consider Hedera helix extract in the text mentioned above. Even though DBpedia has a resource for Hedera helix (http://dbpedia.org/resource/Hedera_helix), the Spotlight outputs it as two URIs as http://dbpedia.org/resource/Hedera and http://dbpedia.org/resource/Helix.

According to my dataset, I would like to get the longest term in DBpedia as the results. In that case, what are the improvements I can do to get my desired output?

I am happy to provide more details if needed.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
EmJ
  • 4,398
  • 9
  • 44
  • 105
  • 1
    Post process the results, or train it on your own dataset or use another tool or even multiple tools. It's non-trivial to solve this problem in general – UninformedUser Jul 30 '19 at 11:43
  • @AKSW Thank you for your comment. Do you have any suggestions for other tools that I can try out or any post processing techniques that I can use in this regard. I look forward to hearing from you. Thank you very much :) – EmJ Jul 30 '19 at 12:18
  • 1
    No, that's NLP and not my topic. Noun phrase detection and then linking to DBpedia is what your corner case needs here. As usual, corner cases can be tricky, NLP starts from basic steps like sentence detection, to pos tagging, then NP detection and so on and so forth. Thus, any previous error will influence later steps – UninformedUser Jul 30 '19 at 12:48
  • @AKSW thanks a lot. sure, I will have a look into the areas that you have mentioned :) – EmJ Jul 30 '19 at 13:16
  • `pyspotlight` might be of interest. Although it probably won't improve recognition at least you'll write less code. It also returns more results than your code above. – Superdooperhero Jan 06 '20 at 09:23

1 Answers1

0

Although I am answering quiet late for this question but you can use Babelnet API in python to obtain dbpedia URI's containing longer texts. I reproduced the problem using the code below:

`from babelpy.babelfy import BabelfyClient

text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory 
bronchial diseases under clinical practice conditions: a prospective, open, 
multicentre postmarketing study in 9657 patients.     In this postmarketing 
study 9657 patients (5181 children) with bronchitis (acute or chronic 
bronchial inflammatory disease) were treated with a syrup containing dried ivy 
leaf extract. After 7 days of therapy, 95% of the patients showed improvement 
or healing of their symptoms. The safety of the therapy was very good with an 
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders 
with 1.5%). In those patients who got concomitant medication as well, it could 
be shown that the additional application of antibiotics had no benefit 
respective to efficacy but did increase the relative risk for the occurrence 
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf 
extract is effective and well tolerated in patients with bronchitis. In view 
of the large population considered, future analyses should approach specific 
issues concerning therapy by age group, concomitant therapy and baseline 
conditions."

# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)

# Babelfy sentence.
babel_client.babelfy(text)


# Get all merged entities.
babel_client.all_merged_entities'

The output will be in the sample format as shown below for all the merged entities in the text. You can further store and process the dictionary structure to extract the dbpedia URIs.

{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},