1

here is one good explained topic on stackoverflow: Replacing few values in a pandas dataframe column with another value

The example is:

BrandName Specialty
A          H
B          I
ABC        J
D          K
AB         L

and the solution is:

df['BrandName'] = df['BrandName'].replace(['ABC', 'AB'], 'A')

The problem is my dataframe is a little bit different, I have two strings in a row:

BrandName Specialty
A          H
B          I
ABC B      J
D          K
AB         L

The desired output still is:

BrandName Specialty
A          H
B          I
A B        J
D          K
A          L

How can I achieve this?

PV8
  • 5,799
  • 7
  • 43
  • 87

1 Answers1

1

Use regex=True for subtring replacement:

df['BrandName'] = df['BrandName'].replace(['ABC', 'AB'], 'A', regex=True)
print (df)
  BrandName Specialty
0         A         H
1         B         I
2       A B         J
3         D         K
4         A         L

Another solution is necessary, if need to avoid replacement values in anaother substrings, like ABCD is not replaced, then need regex words boundaries:

print (df)
  BrandName Specialty
0    A ABCD         H
1         B         I
2     ABC B         J
3         D         K
4        AB         L


L = [r"\b{}\b".format(x) for x in ['ABC', 'AB']]

df['BrandName1'] = df['BrandName'].replace(L, 'A', regex=True)
df['BrandName2'] = df['BrandName'].replace(['ABC', 'AB'], 'A', regex=True)
print (df)
  BrandName Specialty BrandName1 BrandName2
0    A ABCD         H     A ABCD       A AD
1         B         I          B          B
2     ABC B         J        A B        A B
3         D         K          D          D
4        AB         L          A          A

Edit(from the questioner):

To speed it up, you can have a look here: Speed up millions of regex replacements in Python 3

The best one is the trieapproach:

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

union = trie_regex_from_words(strings)
df['BrandName'] = df['BrandName'].replace(union, 'A', regex=True)
PV8
  • 5,799
  • 7
  • 43
  • 87
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • if i run it for `df['BrandName1']`I receive the error: `incomplete escape \x at position 3` – PV8 Oct 09 '19 at 13:03
  • 1
    @PV8 - then change `L = [r"\b{}\b".format(x) for x in ['ABC', 'AB']]` to `import re` `L = [r"\b{}\b".format(re.escape(x)) for x in ['ABC', 'AB']]` - it seems some value with necessary escape for regex – jezrael Oct 09 '19 at 13:04
  • any idea how to speed it up? – PV8 Oct 29 '19 at 10:15
  • @PV8 - really not easy, maybe pure python should be faster. – jezrael Oct 29 '19 at 11:19
  • it alreadys fits to this answer: https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3 right? – PV8 Oct 29 '19 at 12:19
  • not really, in this case it is: `L=['\bWordA\b', '\BWordB\b']`how can I change it to: `L=\b(WordA|WordB)\b` – PV8 Oct 29 '19 at 12:21
  • 1
    @PV8 - sorry, a bit busy. Do you think change `L = ['WordA', 'WordB']` to `\b+ '|'.join(L) + \b` ? Unfortunately regexes are still hard for me, especially if some more complicated or some special features with them. – jezrael Oct 29 '19 at 12:59
  • 1
    no problem, last comment to that, there are someuseful hints, as I already mentioned in my comment above :) – PV8 Oct 29 '19 at 13:01