How to replace a certain word in a dataframe only if it's preceded by a number?

Question

I'm trying to search in a dataframe about certain words listed in dictionary values if any exist it will replaced with the key of values.

units_dic= {'grams':['g','Grams'],
                'kg'   :['kilogram','kilograms']}

the problem is some units abbreviations are letters so it will replace all letters also, I want to do the replacement only if it preceded by a number to make sure it's a unit.

Dataframe

    Id | test 
    ---------
    1  |'A small paperclip has a mass of about 111 g'
    2  |'1 kilogram =1000 g'
    3  |'g is the 7th letter in the ISO basic Latin alphabet'

Replacement Loop

  x = df.copy()
  for k in units_dic:
      for i in range(len(x['test'])):
          for w in units_dic[k]:
              x['test'][i] = str(x['test'][i]).replace(str(w), str(k))

The Output

    Id | test 
    ---------
    1  |'A small paperclip has a mass of about 111 grams'
    2  |'1 kg =1000 grams'
    3  |'grams is the 7th letter in the ISO basic Latin alphabet'

FYI https://stackoverflow.com/questions/6116978/how-to-replace-multiple-substrings-of-a-string — BENY, May 12 '19 at 19:26

score 1 · Answer 1 · answered May 12 '19 at 19:19

1

Try:

for key, val in units_dic.items(): 
    df['test'] = df['test'].replace("\d+[ ]*" + "|".join(val) , key , regex=True)

answered May 12 '19 at 19:19

hacker315

1,996
2
13
23

If there is no space between the number and abbreviations it will remove the number – R_Developer May 12 '19 at 20:15

score 1 · Accepted Answer · answered May 12 '19 at 19:27

1

Regular expressions to the rescue along with flipping the dictionary.

import re

d = {i: k for k, v in units_dic.items() for i in v}
u = r'|'.join(d)
v = fr'(\d+\s?)\b({u})\b'

df.assign(test=[re.sub(v, lambda x: x.group(1) + d[x.group(2)], el) for el in df.test])

   Id                                               test
0   1    A small paperclip has a mass of about 111 grams
1   2                                   1 kg =1000 grams
2   3  g is the 7th letter in the ISO basic Latin alp...

answered May 12 '19 at 19:27

user3483203

50,081
9
65
94

Nothing has changed – R_Developer May 15 '19 at 23:45
`df.assign` isn't in-place, you need to store the result – user3483203 May 16 '19 at 00:13
@R_Developer `df = df.assign...` – user3483203 May 19 '19 at 02:07

Erfan · Answer 3 · 2019-05-12T20:29:56.407

0

We can make use of the lookbehind feature of regex here, which we can specify that it needs to be preceded by a number and optional a whitespace:

for k, v in units_dic.items():
    df['test'] = df['test'].str.replace(f"(?<=[0-9])\s*({'|'.join(v)})\b", f' {k}')

print(df)
   Id                                               test
0   1  'A small paperclip has a mass of about 111 grams'
1   2                                 '1 kg =1000 grams'
2   3  'g is the 7th letter in the ISO basic Latin al...

Explanation
First we use raw + fstring: fr'sometext'

Regular expression:

?<=[0-9] = preceded by a number
\s* is a whitespace
"|".join(v) gives us the values in your dictionary back delimited by a | which is the or operator in regex

edited May 12 '19 at 20:29

answered May 12 '19 at 19:39

Erfan

40,971
8
66
78

Be careful here, the string `45 grapes` would become `45 gramsrapes`. Also, for large dictionaries, you are making potentially a *lot* of replacements when you can do it all in one pass, all of which are fairly slow using `str.replace`. Finally, if no space is found between the number and the abbreviation, this will add a space – user3483203 May 12 '19 at 20:17
what if I want the the opposite (abbreviation then digit ) I tried this (str.replace(f"{v}\s(?<=[0-9])\b", f' {k}')) – R_Developer May 21 '19 at 03:45

How to replace a certain word in a dataframe only if it's preceded by a number?

3 Answers3