I have a flat file with terms and sentences. If any term is found in the sentence, I need to append |present (term|present). Basically, pattern match (case insensitive) and append |present. Also, we need to retain the same case as in the sentence. What approach would be feasible and faster in Python. I tried this using Oracle regex, this takes days to process 70k records.
Right now I am using the below code. Is there a better approach. And also with the current approach, it works fine for the 50 records but the df['words'] is empty when run for the entire 70k records. Not sure what the reason could be.
from pandas import DataFrame
df = {'term': ['Ford', 'EXpensive', 'TOYOTA', 'Mercedes Benz', 'electric', 'cars'],
'sentence': ['Ford is less expensive than Mercedes Benz.' ,'toyota, hyundai mileage is good compared to ford','tesla is an electric-car','toyota too has electric cars','CARS','CArs are expensive.']
}
from pandas import DataFrame
import re
df = DataFrame(df,columns= ['term','sentence'])
pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in df["term"])
df["words"]= df['sentence'].str.findall(pattern, flags=re.IGNORECASE)
def replace_values(row):
if len(row.words)>0:
pat = r"(\b"+"|".join(row.words) +r")(\b)"
row.sentence = re.sub(pat, "\\1|present\\2", row.sentence)
return row
df = df.apply(replace_values, axis=1)