How to find place name in large file with regex python

Question

So I'm tring to find exact words from country.txt file which is define name of places with a descriptions file below:

here is the example of country.txt

Pic de Font Blanca
Roc Mélé
Pic des Langounelles
Pic de les Abelletes
Estany de les Abelletes
Port Vieux de la Coume d’Ose
Port de la Cabanette
Port Dret
Costa de Xurius
Font de la Xona

and here is a description.csv description file

Descriptions file is a list of data that contains titles and descriptions of the article. What I am trying to do is to find exact words of place name from descriptions file with country.txt file

code.py

import csv
import time
import re

allCities = open('country.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")

with open('description.csv') as descriptions,open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for city in sorted(allCities, key=len, reverse=True))

    for eachRow in descriptions_reader:
        title = eachRow['row']
        description = eachRow['desc']
        citiesFound = set()
        found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
        citiesFound.update(found)
        if len(citiesFound)==0:
            output_writer.writerow({'title': title, 'description': description, 'place': " - "})

        else:
            output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
        line += 1
        print(line)

expected output: output

But because country.txt(185.94MB) is a large file, so my code can't fully run. It makes my laptop freeze. Is there a good way to handle this? I think its also because of the pattern line I have makes low performance but I also need a regex to find exact words

@DavidDr90 If there are potential matches like "New York" and "New York City" - the longer candidate must appear first in the pattern. — drowsyone, May 08 '20 at 05:22
@drowsyone so first find all "New York" cantitates and then sort them. Don't sort ~190MB file — DavidDr90, May 08 '20 at 05:24
@DavidDr90 I see, but I also try to not using sorting allCities and still got this problem — drowsyone, May 08 '20 at 05:26
@drowsyone and you shouldn't be reading a whole 190 MB file in memory (`allCities` list / `pattern` string). You should structure your code in such a way that cities are read line by line and not saved into memory. — Mushif Ali Nawaz, May 08 '20 at 05:27
@drowsyone can you explain what are you trying to get in the `pattern` variable? — DavidDr90, May 08 '20 at 05:30
@DavidDr90 that line is appending every city in `allCities` list into a string separated by `|`. — Mushif Ali Nawaz, May 08 '20 at 05:31
@MushifAliNawaz Would you like to help me how to do that? cause I have no idea how to start. Thanks in advance — drowsyone, May 08 '20 at 05:34
@drowsyone you can use something like this: https://stackoverflow.com/questions/6475328/how-can-i-read-large-text-files-in-python-line-by-line-without-loading-it-into — Mushif Ali Nawaz, May 08 '20 at 05:36
@DavidDr90 Im trying to find exact words of description.csv file that match with country.txt. I want my code return title,descriptions and also place based of country.txt — drowsyone, May 08 '20 at 05:36
@drowsyone I'm missing some varible in your code. What is `m` in line: `found = re.findall(pattern, m, re.IGNORECASE | re.MULTILINE)`? — DavidDr90, May 08 '20 at 05:37
@drowsyone I answered your question with some basic implementation. enjoy! — DavidDr90, May 08 '20 at 06:25
As it has not been asked before: Why do you need an image as output? — Jongware, May 08 '20 at 06:37
@MushifAliNawaz i already tried using head and tail but it seems like no difference, my code still running so slow — drowsyone, May 09 '20 at 03:22

DavidDr90 · Accepted Answer · 2020-05-08T07:01:13.433

Here is a first implementation for your problem, you need to take and improve it to your specific needs.

First save all your descriptions to a pandas DataFrame like this:

import pandas as pd
descriptions = pd.read_csv('description.csv')

Then Do not read all file lines to memory. You can read the country file line by line and look for matches in the descriptions data. Use the following:

 with open('country.txt', encoding="utf8") as cities_file, open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line = 0        
    for city in cities_file:
        pattern = r'\b{}\b'.format(re.escape(city.strip())
        for index, row in descriptions.iterrows():
            title = row['row']
            description = row['desc']
            citiesFound = set()            
            found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
            citiesFound.update(found)
            if len(citiesFound)==0:
                output_writer.writerow({'title': title, 'description': description, 'place': " - "})
            else:
                output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
            line += 1
            print(line)

Hey, thank you. It works, but what if I want `descriptions` to just only one time, not iterate until it get a place name? — drowsyone, May 08 '20 at 07:16
@drowsyone you can see this answer [how to filter rows in pandas by regex](https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex) or [searching matching string pattern from dataframe column in python pandas](https://stackoverflow.com/questions/36740680/searching-matching-string-pattern-from-dataframe-column-in-python-pandas) it will iterate for you — DavidDr90, May 08 '20 at 08:00
I dont want iterate. I mean how to join all the name place of each row of article. Not iterate all row based on name place — drowsyone, May 08 '20 at 10:20

How to find place name in large file with regex python

1 Answers1