2

I have a flat file with terms and sentences. If any term is found in the sentence, I need to append |present (term|present). Basically, pattern match (case insensitive) and append |present. Also, we need to retain the same case as in the sentence. What approach would be feasible and faster in Python. I tried this using Oracle regex, this takes days to process 70k records.

Right now I am using the below code. Is there a better approach. And also with the current approach, it works fine for the 50 records but the df['words'] is empty when run for the entire 70k records. Not sure what the reason could be.

from pandas import DataFrame

df = {'term': ['Ford', 'EXpensive', 'TOYOTA', 'Mercedes Benz', 'electric', 'cars'],
        'sentence': ['Ford is less expensive than Mercedes Benz.' ,'toyota, hyundai mileage is good compared to ford','tesla is an electric-car','toyota too has electric cars','CARS','CArs are expensive.']
        }
from pandas import DataFrame
import re
df = DataFrame(df,columns= ['term','sentence'])

pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in df["term"])

df["words"]= df['sentence'].str.findall(pattern, flags=re.IGNORECASE)

def replace_values(row):
    if len(row.words)>0:
        pat = r"(\b"+"|".join(row.words) +r")(\b)"
        row.sentence = re.sub(pat, "\\1|present\\2", row.sentence)
    return row

df = df.apply(replace_values, axis=1)


ABY
  • 393
  • 2
  • 11

1 Answers1

2

Your pattern is very inefficient since it has many unanchored alternatives starting with an identical pattern that can backtrack a lot. Besides, the word boundaries that you want to use you also need to sort the terms by length in descending order to find the longest terms in case they overlap (like, say, Merceded and Mercedes Benz).

So, use

pattern = r'(?i)\b(?:{})\b'.format('|'.join(sorted(df["term"],key=len,reverse=True)))

Or, if you have spcial chars in the terms,

pattern = r'(?i)(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(df["term"],key=len,reverse=True))))

and then

df["words"]= df["sentence"].str.findall(pattern)
df["sentence"].replace(pattern, r"\g<0>|present", inplace=True, regex=True)

The pattern will look like (?i)\b(?:Mercedes Benz|EXpensive|electric|TOYOTA|Ford|cars)\b, it will match - in a case insensitive way due to (?i) - whole words Mercedes Benz, EXpensive, electric, TOYOTA, Ford, cars and in df["sentence"].str.findall(pattern), it will find all non-overlapping occurrences of the pattern match and in df["sentence"].replace(pattern, r"\g<0>|present", inplace=True, regex=True), all matches will be replaced with themselves (\g<0> is a backreference to the whole match) + |present appended to them.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you. Let me try. Is there a way we can exclude replacement beside a couple of special characters? Say, we had a term way and sentence had four-way. The replacement should not happen as it's substring. or if the term is part of a URL https://ford.com. This just messes up the existing URLs. – ABY Oct 16 '19 at 23:16
  • @ABY That is a matter of what you consider a word boundary. If you want it to be any non-word char except a `-` on the left, just add `(?<!-)` after the first `\b`. If you need whitespace boundaries, use `r'(?i)(?<!\S)(?:{})(?!\S)'.format('|'.join(map(re.escape, sorted(df["term"],key=len,reverse=True))))`. Note that if you have thousands of terms, you need to create a [regex trie](https://stackoverflow.com/a/42789508/3832970). – Wiktor Stribiżew Oct 16 '19 at 23:18
  • 1
    :) Your explanation was really helpful. Thanks again. I have posted another question (58428167) which is a slight modification from this. Appreciate any help, if possible. – ABY Oct 17 '19 at 08:32