I am using the following code in which I have a dictionary file, Dictionary.txt
, and a search text file, SearchText.csv
, and I am using regex to find and store the matching keywords and count them.
I have a problem: some of the files are thousands or hundreds of thousands of keywords and it takes too much time to process. I run the code on one dictionary which has 300,000 keywords and after an hour it hasn't written a single row.
So, what should I do to reduce the running time of this process?
import csv
import time
import re
allCities = open('Dictionary.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('SearchText.csv') as descriptions,open('Result---' + str(timestr) + '.csv', 'w', newline='') as output:
descriptions_reader = csv.DictReader(descriptions)
fieldnames = ['Sr_Num', 'Search', 'matched Keywords', 'Total matches']
output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
output_writer.writeheader()
line=0
for eachRow in descriptions_reader:
matches = 0
Sr_Num = eachRow['Sr_Num']
description = eachRow['Text']
citiesFound = set()
for eachcity in allCities:
eachcity=eachcity.strip()
if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
citiesFound.add(eachcity)
matches += 1
if len(citiesFound)==0:
output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " - ", 'Total matches' : matches})
else:
output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " , ".join(citiesFound), 'Total matches' : matches})
line += 1
print(line)
print(" Process Complete ! ")
Here is an example of some rows from Dictionary.txt
:
les Escaldes
Andorra la Vella
Umm al Qaywayn
Ras al Khaimah
Khawr Fakkn
Dubai
Dibba Al Fujairah
Dibba Al Hisn
Sharjah
Ar Ruways