0

I am using the following code in which I have a dictionary file, Dictionary.txt, and a search text file, SearchText.csv, and I am using regex to find and store the matching keywords and count them.

I have a problem: some of the files are thousands or hundreds of thousands of keywords and it takes too much time to process. I run the code on one dictionary which has 300,000 keywords and after an hour it hasn't written a single row.

So, what should I do to reduce the running time of this process?

import csv
import time
import re
allCities = open('Dictionary.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('SearchText.csv') as descriptions,open('Result---' + str(timestr) + '.csv', 'w', newline='') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['Sr_Num', 'Search', 'matched Keywords', 'Total matches']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    for eachRow in descriptions_reader:
        matches = 0
        Sr_Num = eachRow['Sr_Num']
        description = eachRow['Text']
        citiesFound = set()
        for eachcity in allCities:
            eachcity=eachcity.strip()
            if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
                citiesFound.add(eachcity)
                matches += 1
        if len(citiesFound)==0:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " - ", 'Total matches' : matches})

        else:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " , ".join(citiesFound), 'Total matches' : matches})
        line += 1
        print(line)

print(" Process Complete ! ")

Here is an example of some rows from Dictionary.txt:

les Escaldes
Andorra la Vella
Umm al Qaywayn
Ras al Khaimah
Khawr Fakkn
Dubai
Dibba Al Fujairah
Dibba Al Hisn
Sharjah
Ar Ruways
itsme
  • 5
  • 5
  • `csv.DictWriter` is buffered. The `.writerow()` method does not immediately write the results into the disk file. The fact that the file is empty does not mean that there is no progress. Consider printing something to the console to track the execution. – DYZ Feb 25 '19 at 06:12

4 Answers4

2

Perform operations which only need to be executed once only once:

Instead of

eachcity.strip()

and

re.IGNORECASE

in the loop do

allCities = [city.strip().lower() for city in allCities]

outside of the loop, and convert description to lowercase.

You can remove matches += 1 as well, (it's the same as len(citiesFound)), but that will not give much improvement.

If you do not know where your bottleneck really is, look at the tips here and here. Also, run a profiler on your code to find the real culprit. There is also a SO question regarding profiling which is very useful.

Another possibility is to use C or languages which are more optimized for text handling, like awk or sed.

Jan Christoph Terasa
  • 5,781
  • 24
  • 34
2

Your biggest time waster if this line:

if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):

You are searching the whole description for each eachcity. That's a lot of searching. Consider pre-splitting description into words with nltk.word_tokenize(), converting it to a set, converting allCities into a set as well, and taking a set intersect. Something like this:

citiesFound = set(nltk.word_tokenize(description)) & set(allCities)

No inner loop required.

DYZ
  • 55,249
  • 10
  • 64
  • 93
0

Use databases instead of the file system.

In your case I'd probably use Elasticsearch or MongoDB. Those systems are made for handling large amounts of data.

Loïc
  • 11,804
  • 1
  • 31
  • 49
0

In addition to Jan Christoph Terasa answer

1. allCities - are candidate for set

So:

allCities = set([city.strip().lower() for city in allCities])

and even more:

2. Use set of precompiled regular expressions

allCities = set([re.compile( '\\b'+ city.strip().lower() + '\\b') for city in allCities])
Alex Yu
  • 3,412
  • 1
  • 25
  • 38