I want to extract information from a given news, the news are text like this:
O'Sullivan could run in Worlds
Sonia O'Sullivan has indicated that she would like to participate in next month's World Cross Country Championships in St Etienne.
Athletics Ireland have hinted that the 35-year-old Cobh runner may be included in the official line-up for the event in France on 19-20 March. Provincial teams were selected after last Saturday's Nationals in Santry and will be officially announced this week. O'Sullivan is at present preparing for the London marathon on 17 April. The participation of O'Sullivan, currentily training at her base in Australia, would boost the Ireland team who won the bronze three years agio. The first three at Santry last Saturday, Jolene Byrne, Maria McCambridge and Fionnualla Britton, are automatic selections and will most likely form part of the long-course team. O'Sullivan will also take part in the Bupa Great Ireland Run on 9 April in Dublin.
I tried with this code, that extract the information of the ieer doc 'NYT_19980315'.
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
print (nltk.sem.relextract.rtuple(rel))
With this code, the output is:
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']
The problem is that I am trying to send my text (the news that the user insert), in this case is 002.txt. I am not able to pass the text as a parameter of the nltk.sem.extract_rels() function, it says that AttributeError: 'Tree' object has no attribute 'text', does anyone know how to do it? Is there any way to convert the .txt to an ieer document?
The problem is that I am trying to send my text (the news that the user inserts), in this case it is 002.txt. I am not able to pass the text as a parameter of the function nltk.sem.extract_rels(), it tells me: AttributeError: the object 'Tree' does not have the attribute 'text'. Does anyone know how to do it? Is there any way to convert the .txt to an ieer document?
Now, I am working with this code:
import re
import nltk
import os
with open('002.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
# Print unique entity names
print(set(entity_names))
The output is:
{'London', 'Australia', 'Worlds Sonia', 'Santry', 'Bupa Great Ireland Run', 'Maria', 'France', 'Fionnualla Britton', 'Dublin', 'Jolene Byrne', 'Ireland'}
And tried to add this lines nltk.sem.extract_rels():
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
print(nltk.sem.rtuple(rel))