3

I couldnt find a solution in stackoverflow for replacing based on dictionary where the values are in a list.

Dictionary

dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"],
        "application": ["app"]}

Input

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
                         ("laught-out loud so I couldnt too long; did not read"),
                         ("what happened?")], columns=['text'])

Expected output

output_df = pd.DataFrame([("haha TLDR and LOL :D"),
                          ("LOL so I couldnt TLDR"),
                          ("what happened?")], columns=['text'])

Edit

Added a additional entry to the dictionary i.e. "application": ["app"]

The current solutions are giving output as "what happlicationened?"

Please suggest a fix.

GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80

4 Answers4

6

Build an inverted mapping and use Series.replace with regex=True.

mapping = {v : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)

print(input_df)
                    text
0   haha TLDR and LOL :D
1  LOL so I couldnt TLDR

Where,

print(mapping)
{'laught out loud': 'LOL',
 'laught-out loud': 'LOL',
 "too long didn't read": 'TLDR',
 'too long; did not read': 'TLDR'}

To match full words, add word boundaries to each word:

mapping = {rf'\b{v}\b' : k for k, V in dct.items() for v in V}
input_df['text'] = input_df['text'].replace(mapping, regex=True)

print(input_df)
                    text
0   haha TLDR and LOL :D
1  LOL so I couldnt TLDR
2         what happened?

Where,

print(mapping)
{'\\bapp\\b': 'application',
 '\\blaught out loud\\b': 'LOL',
 '\\blaught-out loud\\b': 'LOL',
 "\\btoo long didn't read\\b": 'TLDR',
 '\\btoo long; did not read\\b': 'TLDR'}
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Brilliant ! Please suggest a fix for the follwoing issue. Added a additional entry to the dictionary "application": ["app"] The current solutions are giving output as "what happlicationened?" – GeorgeOfTheRF Oct 31 '18 at 07:55
  • 1
    @ML_Pro you mean you only want it to match whole words? Hmm, in that case try changing "app" to r"\bapp\b", and do this for every string to replace. That is a regex word boundary which would only match whole words. – cs95 Oct 31 '18 at 08:41
  • Thanks. However, I am loading the dict from a JSON file. How do I convert "app" to r"\bapp\b" using python code? I couldn't find a function to convert string to raw string. Accepted your response as the answer. – GeorgeOfTheRF Nov 09 '18 at 03:46
  • Excellent. Got it. – GeorgeOfTheRF Nov 09 '18 at 03:59
1

Using df.apply and a custom function

Ex:

import pandas as pd


def custReplace(value):
    dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

    for k, v in dct.items():
        for i in v:
            if i in value:
                value = value.replace(i, k)
    return value

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

print(input_df["text"].apply(custReplace))

Output:

0     haha TLDR and LOL :D
1    LOL so I couldnt TLDR
Name: text, dtype: object

or

dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

dct = { "(" + "|".join(v) + ")": k for k, v in dct.items()}
input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

print(input_df["text"].replace(dct, regex=True))
Rakesh
  • 81,458
  • 17
  • 76
  • 113
1

Here is how i will go:

import pandas as pd


dct  = {"LOL": ["laught out loud", "laught-out loud"],
        "TLDR": ["too long didn't read", "too long; did not read"]
        }

input_df = pd.DataFrame([("haha too long didn't read and laught out loud :D"),
       ("laught-out loud so I couldnt too long; did not read")], columns=['text'])

dct_inv = {}
for key, vals in dct.items():
    for val in vals:
        dct_inv[val]=key

dct_inv

def replace_text(input_str):
    for key, val in dct_inv.items():
        input_str = str(input_str).replace(key, val)
    return input_str

input_df.apply(replace_text, axis=1).to_frame()
quest
  • 3,576
  • 2
  • 16
  • 26
1

I think the most logical place to start is to reverse your dictionary so your key is your original string which maps to the value of your new string. You can either do that by hand or a million other ways like:

import itertools
dict_rev = dict(itertools.chain.from_iterable([list(zip(v, [k]*len(v))) for k, v in dct.items()]))

Which isn't super readable. Or this one which looks better and I stole from another answer:

dict_rev = {v : k for k, V in dct.items() for v in V}

This requires that each of the values in your dictionary is within a list (or other iterable) e.g. "new key": ["single_val"] otherwise it will explode each character in the string.

You can then do the following (based on the code here How to replace multiple substrings of a string?)

import re
rep = dict((re.escape(k), v) for k, v in dict_rev.items())
pattern = re.compile("|".join(rep.keys()))
input_df["text"] = input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])

This method performs roughly 3 times faster than the simpler more elegant solution:

Simple:

%timeit input_df["text"].replace(dict_rev, regex=True)

425 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Faster:

%timeit input_df["text"].str.replace(pattern, lambda m: rep[re.escape(m.group(0))])

160 µs ± 7.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Sven Harris
  • 2,884
  • 1
  • 10
  • 20