Mapping dictionary to partial string match in dataframe

Question

I have the following df

 ID       Code
ABCD      00FQ
JKFA    8LK9|4F5H
QWST    2RLA|R1T5&8LK9

I am trying to map a dictionary (where the values are List) to the individual elements of Code.

The dictionary looks like so:

d={'00FQ':"['A','B']", '8LK9':"['X']", '4F5H':"['U','Z']", '2RLA':"['H','K']", 'R1T5':"['B','G']" }

I want to map this dict to each element in the Code column separated by |.

The resulting df would look like so:

 ID       Code           Logic
ABCD      00FQ          ['A','B']
JKFA    8LK9|4F5H       ['X'] | ['U','Z']
QWST    2RLA|R1T5&8LK9  ['H','K'] | ['B','G'] & ['X']

BENY · Answer 1 · 2019-11-18T21:15:57.813

4

IIUC split + replace + join

s=df.Code.str.split('|',expand=True).replace(d).stack().groupby(level=0).agg('|'.join)
s
Out[205]: 
0              ['A','B']
1        ['X']|['U','Z']
2    ['H','K']|['B','G']
dtype: object
df['Logic']=s

Update

s=df.Code.str.split('|')
df['Logic']=['|'.join(d.get(y)for y in x) for x in s]
Out[229]: ["['A','B']", "['X']|['U','Z']", "['H','K']|['B','G']"]

Update replace Date 2019-11-18(After op changed the question)

df.Code.replace(d,regex=True)
Out[376]: 
0                    ['A','B']
1              ['X']|['U','Z']
2    ['H','K']|['B','G']&['X']
Name: Code, dtype: object

edited Nov 18 '19 at 21:15

answered Nov 14 '19 at 15:34

BENY

317,841
20
164
234

I get `RecursionError: maximum recursion depth exceeded while calling Python object`. Attempts to increase the capacity of the stack doing `sys.setrecursionlimit(10000)` causes it to crash after a while – MaxB Nov 14 '19 at 16:15
@MaxB your Code have a long string ? – BENY Nov 14 '19 at 16:18
yes some of the list values in the dictionary may have a large number of elements – MaxB Nov 14 '19 at 16:20
1

Then I recommend for loop :-) – BENY Nov 14 '19 at 16:22
I'm assuming those are two separate lines of code. When trying to execute `['|'.join(d.get(y)for y in x) for x in s]` I get the following error `TypeError: sequence item 0: expected str instance, list found` – MaxB Nov 14 '19 at 16:38
1

@MaxB `s=df.Code.str.split('|')` this is the new s – BENY Nov 14 '19 at 16:54
Yes I tried the exact updated code you provided and the TypeError still happens – MaxB Nov 14 '19 at 17:33
1

@MaxB I am not sure what you need now, seems like you just need a replace , I have update the solution – BENY Nov 18 '19 at 21:16
I did not downvote, sry to change question so much I add bounty to make up for it:). doing replace raises the following error: `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()` – MaxB Nov 18 '19 at 22:06
1

@MaxB I am following your sample , and all my solutions work fine with your sample data and yield the expected output , so kindly compare the sample data with your real data frame to see the different – BENY Nov 18 '19 at 22:12
could this be caused by me using old pandas version? (23.4) – MaxB Nov 18 '19 at 22:15
1

@MaxB I do not think so , BTW The dict should be d={'00FQ':"['A','B']", '8LK9':"['X']", '4F5H':"['U','Z']", '2RLA':"['H','K']", 'R1T5':"['B','G']" } – BENY Nov 18 '19 at 22:16
thank you, question updated. My data is the exact same format (just a few more combinations). will troubleshoot. thank u for help will award bounty and delete comments tmrw. – MaxB Nov 18 '19 at 22:27

Dani Mesejo · Answer 2 · 2019-11-18T23:08:02.990

You could use str.replace.

Setup

import pandas as pd

df = pd.DataFrame(data=[['ABCD', '00FQ'], ['JKFA', '8LK9|4F5H'], ['QWST', '2RLA|R1T5&8LK9']], columns=['ID', 'Code'])

d = {'00FQ': "['A','B']", '8LK9': "['X']", '4F5H': "['U','Z']", '2RLA': "['H','K']", 'R1T5': "['B','G']"}


def r(w, d=d):
    """Function to be used for dictionary based replacement"""
    return d[w.group()]

Code

df['Logic'] = df['Code'].str.replace('[^|&]+', r).str.replace('([|&])', r' \1 ')
print(df)

Output

     ID            Code                          Logic
0  ABCD            00FQ                      ['A','B']
1  JKFA       8LK9|4F5H              ['X'] | ['U','Z']
2  QWST  2RLA|R1T5&8LK9  ['H','K'] | ['B','G'] & ['X']

The idea is first to replace everything that is not | or & by it's corresponding value in the dictionary (using the function r). Once this is done replace every | or & (using a capturing group) by itself surrounded by spaces (r' \1 ').

Notice that in the first call to replace the repl parameter is a function (callable), this can be done as specified in the linked documentation:

The callable is passed the regex match object and must return a replacement string to be used. See re.sub().

Note: This solution assumes every possible code is in the dictionary used for replacement, if that is not the case change r to:

def r(w, d=d):
    """Function to be used for dictionary based replacement"""
    return d.get(w.group(), w.group())

For more on regular expressions, see:

`AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas` which is confusing because I have verified that the column is absolutely type `str` — MaxB, Nov 19 '19 at 14:24
@MaxB Did the error occur with the setup provided in the answer? — Dani Mesejo, Nov 19 '19 at 14:58
It did, I tried simplifying the df down to only one element as well. — MaxB, Nov 19 '19 at 15:06
Could you try: `df['Logic'] = df['Code'].astype(str).str.replace('[^|&]+', r).astype(str).str.replace('([|&])', r' \1 ')` instead? — Dani Mesejo, Nov 19 '19 at 15:09
I tried this and it works, but i get `'nan'` in each row for `df['Logic']` — MaxB, Nov 19 '19 at 15:14

Mapping dictionary to partial string match in dataframe

2 Answers2