0

I have a paragraph

Public buses operating on all internal lines in Karak governorate have been on strike yesterday to protest against the decision to remove working buses that are over 12 years old. Bus drivers and owners said the new government\'s decision to remove working buses, which are over 12 years of age, would mean large financial losses to owners of these buses, most of whom suffer from high debt because of their purchase. "The government is not aware of what it is doing, especially in the case of the cancellation of thousands of buses operating in various parts of the Kingdom, which bought hard-earned through the banks and at great financial costs." He pointed out that "buses will remain idle until the government review the decision as unfair to thousands of families in the Kingdom." For his part, the head of the office of the Karak Transport Regulatory Authority, Mahmoud Al-Sarayra, did not answer Al Ghad\'s calls for a response to the complaints of drivers and bus owners

Running the following code on the paragraph:

import nltk
sentences = [x.replace('.','').replace('"','') for x in nltk.sent_tokenize(paragraph)]
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = [x for x in nltk.ne_chunk_sents(tagged_sentences)]
entities=np.unique(np.array([x for s in chunked_sentences for x in s if type(x)==nltk.tree.Tree ])).tolist()

NLTK function ne_chunk_sents gives me back the following named entities:

[Tree('GPE', [('Bus', 'NNP')]),
 Tree('GPE', [('Karak', 'NNP')]),
 Tree('GPE', [('Public', 'NNP')]),
 Tree('ORGANIZATION', [('Karak', 'NNP'), ('Transport', 'NNP'), ('Regulatory', 'NNP'), ('Authority', 'NNP')]),
 Tree('ORGANIZATION', [('Kingdom', 'NNP')]),
 Tree('PERSON', [('Al', 'NNP'), ('Ghad', 'NNP')]),
 Tree('PERSON', [('Mahmoud', 'NNP'), ('Al-Sarayra', 'NNP')])]

GPE stands for "Geopolitical Entity". I'm not sure that "Public" and "Bus" qualify. I know that Karak is what I'm looking for. What's the easiest way in NLTK to distinguish common English words such as Public and Bus from works which are not English and are most likely place names?

NOTE: This is similar to this question from 2 years ago that didn't get a definitive answer.

Lars Ericson
  • 1,952
  • 4
  • 32
  • 45

1 Answers1

0

So following the lead of the similar question from 2 years ago, here is a solution:

e2=[(x.label(),' '.join([y for y,z in x[0:]])) for x in entities]
e3=[y for x,y in e2 if x == 'GPE']
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
e4=[x for x in e3 if x.lower() not in english_vocab]

Then e4 is the list

['Karak']
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45