-1

So I'm tring to find exact words from country.txt file which is define name of places with a descriptions file below:

here is the example of country.txt

Pic de Font Blanca
Roc Mélé
Pic des Langounelles
Pic de les Abelletes
Estany de les Abelletes
Port Vieux de la Coume d’Ose
Port de la Cabanette
Port Dret
Costa de Xurius
Font de la Xona

and here is a description.csv description file

Descriptions file is a list of data that contains titles and descriptions of the article. What I am trying to do is to find exact words of place name from descriptions file with country.txt file

code.py

import csv
import time
import re

allCities = open('country.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")

with open('description.csv') as descriptions,open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for city in sorted(allCities, key=len, reverse=True))

    for eachRow in descriptions_reader:
        title = eachRow['row']
        description = eachRow['desc']
        citiesFound = set()
        found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
        citiesFound.update(found)
        if len(citiesFound)==0:
            output_writer.writerow({'title': title, 'description': description, 'place': " - "})

        else:
            output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
        line += 1
        print(line)

expected output: output

But because country.txt(185.94MB) is a large file, so my code can't fully run. It makes my laptop freeze. Is there a good way to handle this? I think its also because of the pattern line I have makes low performance but I also need a regex to find exact words

drowsyone
  • 27
  • 6

1 Answers1

0

Here is a first implementation for your problem, you need to take and improve it to your specific needs.

First save all your descriptions to a pandas DataFrame like this:

import pandas as pd
descriptions = pd.read_csv('description.csv')

Then Do not read all file lines to memory. You can read the country file line by line and look for matches in the descriptions data. Use the following:

 with open('country.txt', encoding="utf8") as cities_file, open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line = 0        
    for city in cities_file:
        pattern = r'\b{}\b'.format(re.escape(city.strip())
        for index, row in descriptions.iterrows():
            title = row['row']
            description = row['desc']
            citiesFound = set()            
            found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
            citiesFound.update(found)
            if len(citiesFound)==0:
                output_writer.writerow({'title': title, 'description': description, 'place': " - "})
            else:
                output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
            line += 1
            print(line)
DavidDr90
  • 559
  • 5
  • 20
  • Hey, thank you. It works, but what if I want `descriptions` to just only one time, not iterate until it get a place name? – drowsyone May 08 '20 at 07:16
  • @drowsyone you can see this answer [how to filter rows in pandas by regex](https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex) or [searching matching string pattern from dataframe column in python pandas](https://stackoverflow.com/questions/36740680/searching-matching-string-pattern-from-dataframe-column-in-python-pandas) it will iterate for you – DavidDr90 May 08 '20 at 08:00
  • I dont want iterate. I mean how to join all the name place of each row of article. Not iterate all row based on name place – drowsyone May 08 '20 at 10:20
  • Hey I think your code still can't run for large file – drowsyone May 09 '20 at 00:12