5

I am trying to extract the location name, country name, city name, tourist places from txt file by using nlp or scapy library in python.

I have tried below:

import spacy
en = spacy.load('en')

sents = en(open('subtitle.txt').read())
place = [ee for ee in sents.ents]

Getting output:

[1, 
, three, London, 
, 
, 
, 
, first, 
, 
, 00:00:20,520, 
, 
, London, the

4
00:00:20,520, 00:00:26,130
, Buckingham Palace, 
, 

I just want only location name, country name, city name and any place within city.

I also tried by using NLP:

import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

with open('subtitle.txt', 'r') as f:
    sample = f.read()


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    #print (extract_entity_names(tree))

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print (entity_names)

# Print unique entity names
print (set(entity_names))

Output Getting:

{'Okay', 'Buckingham Palace', 'Darwin Brasserie', 'PDF', 'London', 'Local Guide', 'Big Ben'}

Here, also getting unwanted words like 'Okay', 'PDF', 'Local Guide' and some places are missing.

Please suggest.

Edit-1

Script

import spacy
nlp = spacy.load('en_core_web_lg')

gpe = [] # countries, cities, states
loc = [] # non gpe locations, mountain ranges, bodies of water


doc = nlp(open('subtitle.txt').read())
for ent in doc.ents:
    if (ent.label_ == 'GPE'):
        gpe.append(ent.text)
    elif (ent.label_ == 'LOC'):
        loc.append(ent.text)

cities = []
countries = []
other_places = []
import wikipedia
for text in gpe:
    summary = str(wikipedia.summary(text))
    if ('city' in summary):
        cities.append(text)
        print (cities)
    elif ('country' in summary):
        countries.append(text)
        print (countries)
    else:
        other_places.append(text)
        print (other_places)

for text in loc:
    other_places.append(text)
    print (other_places)

By using answered script: getting below output

['London', 'London']
['London', 'London', 'London']
['London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London']
['London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London', 'London']
user10468005
  • 157
  • 3
  • 11
  • See this thread: https://stackoverflow.com/questions/59444065/differentiate-between-countries-and-cities-in-spacy-ner/68345017#68345017 – Camorales197 Jul 12 '21 at 10:43

1 Answers1

8

You are looking for Named Entities. spaCy is an efficient library for finding Named Entities in a text, but you should use it accordingly to the docs.

You are looking for locations, countries and cities. Those places fall in the categories GPE and LOC in the spaCy NER tagger. Specifically, GPE is for countries, cities and states and LOC is for non GPE locations, mountains, bodies of water, etc.

If you just need those names into a list, you can use the NER tagger and look only for these tags. If you need to seperate cities from countries for example, you could then perform a wikipedia query and check the summary to find out if it is a city or a country. For this, you may find the wikipedia library for python useful.

Example code:

import spacy
nlp = spacy.load('en_core_web_lg')

gpe = [] # countries, cities, states
loc = [] # non gpe locations, mountain ranges, bodies of water


doc = nlp(open('subtitle.txt').read())
for ent in doc.ents:
    if (ent.label_ == 'GPE'):
        gpe.append(ent.text)
    elif (ent.label_ == 'LOC'):
        loc.append(ent.text)

cities = []
countries = []
other_places = []
import wikipedia
for text in gpe:
    summary = str(wikipedia.summary(text))
    if ('city' in summary):
        cities.append(text)
    elif ('country' in summary):
        countries.append(text)
    else:
        other_places.append(text)

for text in loc:
    other_places.append(text)

If you find the wikipedia method insufficient or slow, you could also try training NER tagger with your own NER tags. For this, have a look here.

gdaras
  • 9,401
  • 2
  • 23
  • 39