How to use a huge matching keywords file that using takes too long to process?

Question

I am using the following code in which I have a dictionary file, Dictionary.txt, and a search text file, SearchText.csv, and I am using regex to find and store the matching keywords and count them.

I have a problem: some of the files are thousands or hundreds of thousands of keywords and it takes too much time to process. I run the code on one dictionary which has 300,000 keywords and after an hour it hasn't written a single row.

So, what should I do to reduce the running time of this process?

import csv
import time
import re
allCities = open('Dictionary.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('SearchText.csv') as descriptions,open('Result---' + str(timestr) + '.csv', 'w', newline='') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['Sr_Num', 'Search', 'matched Keywords', 'Total matches']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    for eachRow in descriptions_reader:
        matches = 0
        Sr_Num = eachRow['Sr_Num']
        description = eachRow['Text']
        citiesFound = set()
        for eachcity in allCities:
            eachcity=eachcity.strip()
            if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
                citiesFound.add(eachcity)
                matches += 1
        if len(citiesFound)==0:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " - ", 'Total matches' : matches})

        else:
            output_writer.writerow({'Sr_Num': Sr_Num, 'Search': description, 'matched Keywords': " , ".join(citiesFound), 'Total matches' : matches})
        line += 1
        print(line)

print(" Process Complete ! ")

Here is an example of some rows from Dictionary.txt:

les Escaldes
Andorra la Vella
Umm al Qaywayn
Ras al Khaimah
Khawr Fakkn
Dubai
Dibba Al Fujairah
Dibba Al Hisn
Sharjah
Ar Ruways

Can you explain more clearly what you are trying to do? Ideally with an example showing how the input is formatted and what you want the output to look like? — Phydeaux, Feb 24 '19 at 18:01
@Phydeaux I have some dictionaries files which are .txt files, and a file of .CSV which have 2 column sr_num and text column, from text column I am matching the all keywords row wise and store and count the matched keywords and then write it in a new file row by row. — csit, Feb 24 '19 at 18:05
can you show an example of how `Dictionary.txt` is formatted? (edited into the question) — Phydeaux, Feb 24 '19 at 18:07
One of the first things you should do _before_ attempting to optimize a body of code is to profile it and determine where it's spending most of its time. This will tell you what portions are most important to work on and where you're likely to see the most improvement if you can speed it up. The good news is doing that is relatively easy: See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) — martineau, Feb 24 '19 at 19:37

snakecharmerb · Accepted Answer · 2019-03-09T11:23:46.787

As user Martineau comments, it's best to profile code to determine where optimisation will be most effective, and to measure the effects of any attempted optimisations.

In the absence of any profiling data, the best candidate for optimisation is likely to be this inner loop:

for eachcity in allCities:
    eachcity=eachcity.strip()
    if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
        citiesFound.add(eachcity)

Here the code is calling strip on each string in allCities - something which could be done just once, outside the loop, then calling re.search for each city.

It may be more efficient to combine all the cities into a single regex using the | metacharacter, which denotes alternate matches. For example the pattern

r'foo|bar'

will match 'foo' or 'bar'.

Here's a simple example:

>>> text = 'I am flying from les Escaldas to Dubai via Sharjah on Monday'
>>> cities = ['Sharjah', 'Dubai', 'les Escaldas', 'Ar Ruways']
>>> pattern = r'|'.join(r'\b{}\b'.format(re.escape(city)) for city in cities)
>>> pattern
'\\bSharjah\\b|\\bDubai\\b|\\bles Escaldas\\b|\\bAr Ruways\\b'
>>> matches = re.findall(pattern, text)
>>> print(matches)
['les Escaldas', 'Dubai', 'Sharjah']

Calling re.escape on each city name prevents the search pattern being altered if one of the city names contains a character which is also a regular expression metacharacter.

Applying this technique to the code in the question:

# We only need to create the regex pattern once,
# so do it outside the loop.
pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for city in allCities)

for eachRow in descriptions_reader:
    Sr_Num = eachRow['Sr_Num']
    description = eachRow['Text']
    citiesFound = set()
    found = re.findall(pattern, description, re.IGNORECASE)
    citiesFound.update(found)
    matches = len(found)

The behaviour of the | metacharacter is described in the documentation.

REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.

So if there are potential matches which are substrings of other candidates - like "Dubai" and "Dubai City" - the longer candidate must appear earlier in the pattern, otherwise the engine will find the shorter and return it as the match.

To prevent this, sort allCities in descending order of length before creating the pattern:

allCities.sort(key=len, reverse=True)

or use sorted if the order of allCities must be preserved:

pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for
                    city in sorted(allCities, key=len, reverse=True))

Kindly can you explain me little more that how above example is work. e.g. according to above text if the text will `I am flying from les Escaldas to Dubai City via Sharjah on Monday ` then what will be match, what is the logic of matching creteria, i think you can understand what I mean? and yes also in cities I have ` "Dubia" , Dubai City" ` — csit, Mar 09 '19 at 08:20
I mean tell me with more explanation of the logic of selection of matching the keyword in text. — csit, Mar 09 '19 at 08:22
@csit I've added informatiion of the order of matching to the answer — snakecharmerb, Mar 09 '19 at 08:48
ok thankyou so Much , and one little and simple question is that,, is this save same match keyword one time or more then one time in ` `citiesFound = set() ` I mean e.g. ` i love dubia and want to live in dubai` in this word dubai is occurs 2 times so , this code save dubai 2 times or 1 times ,,,hoq many times it count the match word? — csit, Mar 09 '19 at 09:10
I want to count this match one time in one row, becoz this match word is already count in this row. — csit, Mar 09 '19 at 09:15
A `set` only contains each of it's elements once. If you want to keep duplicate matches add them to a list instead. — snakecharmerb, Mar 09 '19 at 09:17
okay thankyou,, no no I want just one time ,, so thankyou so much for your help. — csit, Mar 09 '19 at 09:19
You should use `re.escape` on the strings before building an alteration of them. Also the strings should be sorted longest to shortest `key=lambda s: -len(s)` so that the longest string is chosen when they are strings that are prefixes of others. This is due to that `re` picks the left most matching string in an alteration when their are multiple possible matches. — Dan D., Mar 09 '19 at 10:17
@DanD. good point on `re.escape`, I've amended the answer to refer to this. For sorting, the approach in the answer `sorted(L, key=len, reverse=True)` seems to be 4x faster on my machine than `sorted(L, key=lambda s: -len(s))` - or am I missing something? — snakecharmerb, Mar 09 '19 at 11:27

score 0 · Answer 2 · answered Feb 24 '19 at 22:36

You could use a set instead of a list of city names and split the description on spaces to isolate words. This may work faster than a regular expression

For example:

...
allCities = open('Dictionary.txt', encoding="utf8").readlines()
citySet = set([city.lower().strip() for city in allCities]
...
    ...
    citiesFound = set([ word for word in description.split(" ") if word.lower() in citySet ])
    ...

How to use a huge matching keywords file that using takes too long to process?

2 Answers2