I've made an n-gram extractor that pulls organization's names from texts. However, the program only pulls the first letter of the first word and the last word. For example, if the phrase "Sprint International Corporation"
appears in the text, the program will return "s corporation"
as the n-gram. Do you know what I'm doing wrong? I've posted the code and output below. Thanks.
This is the code for the n-gram extractor.
def org_ngram(classified_text):
orgs = [c for c in classified_text if (c[1]=="ORGANIZATION")]
#print(orgs)
combined_orgs = []
prev_org = False
new_org = ("", "ORGANIZATION")
for i in range(len(classified_text)):
if classified_text[i][1] != "ORGANIZATION":
prev_org = False
else:
if prev_org:
new_org = new_org[0] + " " + classified_text[i][0].lower()
else:
combined_orgs.append(new_org)
new_org = classified_text[i][0].lower()
prev_org = True
combined_orgs.append(new_org)
combined_orgs = combined_orgs[1:]
return combined_orgs
Here is the text that I analyze and the program I use to analyze it.
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
st = StanfordNERTagger('C:\\path\\english.all.3class.distsim.crf.ser.gz',
'C:\\Users\\path\\stanford-ner.jar',
encoding='utf-8')
text = "Trump met with representatives from Sprint International Corporation, Nike Inc, and Wal-Mart Company regarding the trade war."
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
orgs = org_ngram(classified_text)
print(orgs)
Here is the current output.
['s corporation', 'n inc', 'w company']
This is what I want to output to look like.
['sprint international corporation', 'nike inc', 'wal-mart company']