Python replace strings using regex on large dataset

Question

I have recently started using the re package in order to clean up transaction descriptions.

Example of original transaction descriptions:

['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']

For a list of expressions I would like to replace the current description with a better one, e.g. if one of the list entries contains 'facebook' or 'amazon' preceded by a space (or at the beginning of the string), I want to replace the entire list entry by the word 'facebook' or 'amazon' respectively, i.e.:

['bread', 'facebook', 'milk', 'savings', 'amazon', 'holiday_amazon']

As I only want to pick it up if the word facebook is preceded by a space or if it is at the beginning of a word, I have created regex that represent this, e.g. (^|\s)facebook. Note that this is only an example, in reality I want to filter out more complex expressions as well.

In total I have a dataframe with 90 such expressions that I want to replace.

My current code (with minimum workable example) is:

import pandas as pd
import re

def specialCases(list_of_narratives, replacement_dataframe):
    # Create output array
    new_narratives = []
    special_cases_identifiers = replacement_dataframe["REGEX_TEST"]
    # For each string element of the list
    for memo in list_of_narratives:
        index_count = 0
        found_count = 0
        for i in special_cases_identifiers:
            regex = re.compile(i)
            if re.search(regex, memo.lower()) is not None:
                new_narratives.append(replacement_dataframe["NARRATIVE_TO"].values[index_count].lower())
                index_count += 1
                found_count += 1
                break
            else:
                index_count += 1
        if found_count == 0:
            new_narratives.append(memo.lower())
    return new_narratives

# Minimum example creation
list_of_narratives = ['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
list_of_regex_expressions = ['(^|\s)facebook', '(^|\s)amazon']
list_of_regex_replacements = ['facebook', 'amazon']
replacement_dataframe = pd.DataFrame({'REGEX_TEST': list_of_regex_expressions, 'NARRATIVE_TO': list_of_regex_replacements})

# run code
new_narratives = specialCases(list_of_narratives, replacement_dataframe)

However, with over 1 million list entries and 90 different regex expressions to be replaced (i.e. len(list_of_regex_expressions) is 90) this is extremely slow, presumably due to the double for loop.

Could someone help me improve the performance of this code?

Build the regex patterns as `'f(?<!\S[^f])acebook'`. Or, use [Speed up millions of regex replacements in Python 3](https://stackoverflow.com/questions/42742810) solutions. — Wiktor Stribiżew, Aug 22 '18 at 18:20
Thanks for the link Wiktor, I had already seen and thought of that. However, in that example the aim is to remove words (i.e. replace with ""), whereas in my case I want to replace an expression with a word, which depends on the expression that matches. — Exclusive92, Aug 22 '18 at 23:11
One improvement is to use `list_of_regex_expressions = [re.compile('(^|\s)facebook'), re.compile('(^|\s)amazon')]` and then remove `regex = re.compile(i)` in the method and use `if re.search(i, memo.lower())` — Wiktor Stribiżew, Aug 23 '18 at 08:04
Thanks Wiktor, this already helped a bit, but I'd still be keen to take a faster solution onboard if it exists — Exclusive92, Aug 23 '18 at 13:47
What kind of solution are you waiting for? What about my top comment suggestion for a pattern? It is a better idea than using an alternation at the start of the pattern. — Wiktor Stribiżew, Aug 23 '18 at 13:49
I assumed that the best way of increasing speed would be to get rid of the double loop, or by somehow vectorising the approach. In terms of the first comment suggestion for a pattern, I struggle to understand what each of these characters implies given that I am relatively new to Regex — Exclusive92, Aug 23 '18 at 15:24
Still, try the solution based on that thread I refer to in the top comment - https://ideone.com/oe68V1 — Wiktor Stribiżew, Aug 24 '18 at 14:37

Python replace strings using regex on large dataset

0 Answers0