Match different patterns also use abbreviation dict

Question

I have a problem. I have a text which is a freetext. And a regex should regnoize element what is a pattern. Unfortunately for some elements there are abbrevation. So thats why I generated a abbrevation dict. Is there an option to also loop through the dict. If the element is inside the dict? That the abbrevation ca also does match.

Dataframe

   customerId                text element  code
0           1  Something with Cat     cat     0
1           3  That is a huge dog     dog     1
2           3         Hello agian   mouse     2
3           3        This is a ca     cat     0

Code

import pandas as pd
import copy
import re
d = {
    "customerId": [1, 3, 3, 3],
    "text": ["Something with Cat", "That is a huge dog", "Hello agian", 'This is a ca'],
     "element": ['cat', 'dog', 'mouse', 'cat'],
     "code": [9,8,7, 9]
}
df = pd.DataFrame(data=d)
df['code'] = df['element'].astype('category').cat.codes
print(df)

abbreviation = {
    "cat": {
        "abbrev1": "ca",
    },
} 


%%time

elements = df['element'].unique()
def f(x):
    match = 999
    for element in elements:
        elements2 = [element]
        y = bool(re.search(element, x['text'], re.IGNORECASE))
        #^ here
        if(y):
            #print(forwarder)
            match = x['code']
            #match = True
            break
    x['test'] = match
    return x
df['test'] = None
df = df.apply(lambda x: f(x), axis = 1)

What I have

   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge dog     dog     1     1
2           3         Hello agian   mouse     2   999
3           3        This is a ca     cat     0   999

What I want

   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge dog     dog     1     1
2           3         Hello agian   mouse     2   999
3           3        This is a ca     cat     0     0

To be clear, should cat and dog be inverted in `element`, would there be no match in the first two rows or wouldn't it change anyhing? — mozway, Jul 18 '22 at 07:58
@mozway if understand your questions right - It wouldn't it change anything — Test, Jul 18 '22 at 08:06

Dani Mesejo · Answer 1 · 2022-07-18T09:03:33.937

TL;DR

One approach:

# create an inverse lookup dictionary for the abbreviations
lookup = {v.lower(): k for k, d in abbreviation.items() for _, v in d.items()}
elements = df['element'].unique()

# replace the abbreviations with the full words
normal = df["text"].str.replace(fr"\b({'|'.join(lookup.keys())})\b", lambda x: lookup[x.group().lower()], regex=True, flags=re.IGNORECASE)

# then find the words in text with the full words
df["test"] = np.where(normal.str.contains(fr"\b({'|'.join(elements)})\b", flags=re.IGNORECASE), df["code"], 999)

print(df)

Output

   customerId                text element  code  test
0           1  Something with Cat     cat     0     0
1           3  That is a huge dog     dog     1     1
2           3         Hello agian   mouse     2   999
3           3        This is a ca     cat     0     0

Full Explanation

The first step is to create a lookup dictionary for the abbreviations:

lookup = {v.lower(): k for k, d in abbreviation.items() for _, v in d.items()}

for the current example lookup points to the following value:

{'ca': 'cat'}

the second step is to use str.replace to replace the abbreviations with the full words:

normal = df["text"].str.replace(fr"\b({'|'.join(lookup.keys())})\b", lambda x: lookup[x.group().lower()], regex=True, flags=re.IGNORECASE)

the variable normal holds the value:

0    Something with Cat
1    That is a huge dog
2           Hello agian
3         This is a cat
Name: text, dtype: object

note that the second parameter for str.replace is a callable, a description of the functionality can be found in the documentation (emphasis mine):

repl str or callable Replacement string or a callable. The callable is passed the regex match object and must return a replacement string to be used. See re.sub().

finally use str.contains to create a boolean mask and pass it to np.where:

df["test"] = np.where(normal.str.contains(fr"\b({'|'.join(elements)})\b", flags=re.IGNORECASE), df["code"], 999)

in other words if there is a match use the corresponding value in df["code"] otherwise use 999 to signal no match was found.

Note on performance

If the number of abbreviations is large and performance is an issue you could use trrex:

import trrex as tx

# replace the abbreviations with the full words
normal = df["text"].str.replace(tx.make(lookup.keys()), lambda x: lookup[x.group().lower()], regex=True, flags=re.IGNORECASE)

# then find the words in text with the full words
df["test"] = np.where(normal.str.contains(tx.make(elements), flags=re.IGNORECASE), df["code"], 999)

Note that you need to install the library:

pip install trrex

See these answers ([1] and [2]) for a detailed discussion on performance gains.

DISCLAIMER: I'm the author of trrex

Match different patterns also use abbreviation dict

1 Answers1

TL;DR

Full Explanation

Note on performance