what should I do to reduce the running time of this process, matching Text file keywords in Csv Column In python?

Question

I am using the following code in which I have a dictionary file, Dictionary.txt, and a search text file, SearchText.csv, and I am using regex to find and store the matching keywords and count them.

I have a problem: some of the files are thousands or hundreds of thousands of keywords and it takes too much time to process. I run the code on one dictionary which has 300,000 keywords and after an hour it hasn't written a single row.

So, what should I do to reduce the running time of this process?

import csv
import time
import re
allCities = open('Dictionary.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('SearchText.csv') as descriptions,open('Result---' + str(timestr) + '.csv', 'w', newline='') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['Sr_Num', 'Search', 'matched Keywords', 'Total matches']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    for eachRow in descriptions_reader:
        matches = 0
        Sr_Num = eachRow['Sr_Num']
        description = eachRow['Text']
        citiesFound = set()
        for eachcity in allCities:
            eachcity=eachcity.strip()
            if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
                citiesFound.add(eachcity)
                matches += 1
        if len(citiesFound)==0:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " - ", 'Total matches' : matches})

        else:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " , ".join(citiesFound), 'Total matches' : matches})
        line += 1
        print(line)

print(" Process Complete ! ")

Here is an example of some rows from Dictionary.txt:

les Escaldes
Andorra la Vella
Umm al Qaywayn
Ras al Khaimah
Khawr Fakkn
Dubai
Dibba Al Fujairah
Dibba Al Hisn
Sharjah
Ar Ruways

`csv.DictWriter` is buffered. The `.writerow()` method does not immediately write the results into the disk file. The fact that the file is empty does not mean that there is no progress. Consider printing something to the console to track the execution. — DYZ, Feb 25 '19 at 06:12

Jan Christoph Terasa · Answer 1 · 2019-02-25T06:26:50.490

Perform operations which only need to be executed once only once:

Instead of

eachcity.strip()

and

re.IGNORECASE

in the loop do

allCities = [city.strip().lower() for city in allCities]

outside of the loop, and convert description to lowercase.

You can remove matches += 1 as well, (it's the same as len(citiesFound)), but that will not give much improvement.

If you do not know where your bottleneck really is, look at the tips here and here. Also, run a profiler on your code to find the real culprit. There is also a SO question regarding profiling which is very useful.

Another possibility is to use C or languages which are more optimized for text handling, like awk or sed.

score 2 · Answer 2 · answered Feb 25 '19 at 06:17

2

Your biggest time waster if this line:

if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):

You are searching the whole description for each eachcity. That's a lot of searching. Consider pre-splitting description into words with nltk.word_tokenize(), converting it to a set, converting allCities into a set as well, and taking a set intersect. Something like this:

citiesFound = set(nltk.word_tokenize(description)) & set(allCities)

No inner loop required.

answered Feb 25 '19 at 06:17

DYZ

55,249
10
64
93

1

I like the `set` approach here. Could use a `Counter` after that to get counts of cities matched. – Tammo Heeren Feb 25 '19 at 06:23
@TammoHeeren Sure. (Thought not required in the OP.) – DYZ Feb 25 '19 at 06:23
Correct. Thought OP was looking for individual counts, but was looking for total count only. – Tammo Heeren Feb 25 '19 at 06:26

score 0 · Answer 3 · answered Feb 25 '19 at 06:16

0

Use databases instead of the file system.

In your case I'd probably use Elasticsearch or MongoDB. Those systems are made for handling large amounts of data.

answered Feb 25 '19 at 06:16

Loïc

11,804
1
31
49

score 0 · Answer 4 · answered Feb 25 '19 at 06:25

In addition to Jan Christoph Terasa answer

1. `allCities` - are candidate for `set`

So:

allCities = set([city.strip().lower() for city in allCities])

and even more:

2. Use `set` of precompiled regular expressions

allCities = set([re.compile( '\\b'+ city.strip().lower() + '\\b') for city in allCities])

what should I do to reduce the running time of this process, matching Text file keywords in Csv Column In python?

4 Answers4

1. allCities - are candidate for set

2. Use set of precompiled regular expressions

1. `allCities` - are candidate for `set`

2. Use `set` of precompiled regular expressions