I have recently started using the re package in order to clean up transaction descriptions.
Example of original transaction descriptions:
['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
For a list of expressions I would like to replace the current description with a better one, e.g. if one of the list entries contains 'facebook' or 'amazon' preceded by a space (or at the beginning of the string), I want to replace the entire list entry by the word 'facebook' or 'amazon' respectively, i.e.:
['bread', 'facebook', 'milk', 'savings', 'amazon', 'holiday_amazon']
As I only want to pick it up if the word facebook is preceded by a space or if it is at the beginning of a word, I have created regex that represent this, e.g. (^|\s)facebook. Note that this is only an example, in reality I want to filter out more complex expressions as well.
In total I have a dataframe with 90 such expressions that I want to replace.
My current code (with minimum workable example) is:
import pandas as pd
import re
def specialCases(list_of_narratives, replacement_dataframe):
# Create output array
new_narratives = []
special_cases_identifiers = replacement_dataframe["REGEX_TEST"]
# For each string element of the list
for memo in list_of_narratives:
index_count = 0
found_count = 0
for i in special_cases_identifiers:
regex = re.compile(i)
if re.search(regex, memo.lower()) is not None:
new_narratives.append(replacement_dataframe["NARRATIVE_TO"].values[index_count].lower())
index_count += 1
found_count += 1
break
else:
index_count += 1
if found_count == 0:
new_narratives.append(memo.lower())
return new_narratives
# Minimum example creation
list_of_narratives = ['bread','payment to facebook.com', 'milk', 'savings', 'amazon.com $xx ased lux', 'holiday_amazon']
list_of_regex_expressions = ['(^|\s)facebook', '(^|\s)amazon']
list_of_regex_replacements = ['facebook', 'amazon']
replacement_dataframe = pd.DataFrame({'REGEX_TEST': list_of_regex_expressions, 'NARRATIVE_TO': list_of_regex_replacements})
# run code
new_narratives = specialCases(list_of_narratives, replacement_dataframe)
However, with over 1 million list entries and 90 different regex expressions to be replaced (i.e. len(list_of_regex_expressions) is 90) this is extremely slow, presumably due to the double for loop.
Could someone help me improve the performance of this code?