0

In my dataset, i have a feature (called Size) like this one:

import pandas as pd


dit={"Size" : ["0","0","5mm","12-15","3-10"] }
dt = pd.DataFrame(data=dit)

This feature specifies a size in a range (with minimum and maximum) or by a specific number.

Now, i wish to replace the values by the Maximum of each range.

So, in my example the output should be

dit={"Size" : ["0","0","5mm","15mm","10mm"] }

This is what i have tried:

import re
dt = re.split("-",dt.loc[:,"Size"])

But, it complains with:

TypeError: expected string or bytes-like object

and this is makes sense because the type of dt.loc[:," Size (in mm)"] is:

pandas.core.series.Series

My question is, how can i simply update this column without a loop?

Jeff
  • 7,767
  • 28
  • 85
  • 138

4 Answers4

3

You may use

import pandas as pd
import re

dit={"Size" : ["0","0","5mm","12-15","3-10"] }
dt = pd.DataFrame(data=dit)
rx = r'(\d+)(?:mm)?-(\d+)(?:mm)?'
dt['Size']=dt['Size'].apply(lambda x: re.sub(rx, lambda z: str(max(int(z.group(1)), int(z.group(2)))) + "mm", x))

Output:

>>> dt
   Size
0     0
1     0
2   5mm
3  15mm
4  10mm

The regex matches

  • (\d+) - Group 1: one or more digits
  • (?:mm)? - an optional mm char sequence
  • - - a hyphen
  • (\d+) - Group 2: one or more digits
  • (?:mm)? - - an optional mm char sequence.

The str(max(int(z.group(1)), int(z.group(2)))) + "mm" keeps the biggest value of the numbers in two groups and appends mm at the end.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • if i have `Nan` instead of `Zeros`, then what can i do? – Jeff Jul 21 '19 at 06:56
  • @Jeff Add `.fillna(0)`. See [How can I replace all the NaN values with Zero's in a column of a pandas dataframe](https://stackoverflow.com/questions/13295735). – Wiktor Stribiżew Jul 21 '19 at 06:58
  • Typically, you'd use `str.replace` for string/regex replacement: `dt['Size'].str.replace(rx, lambda z: z.group(1) if int(z.group(1)) > int(z.group(2)) else z.group(2))`. – cs95 Jul 21 '19 at 07:24
1

This expression might return the desired output, which I have assumed mm is not desired:

import pandas as pd

dit={"Size" : ["0","0","5mm","12-15mm","3mm-10mm"] }
dt = pd.DataFrame(data=dit)

dt['max_size']=dt["Size"].str.replace(r"^.*?(\d+)\D*$",r"\1")

print(dt)

Output

       Size max_size
0         0        0
1         0        0
2       5mm        5
3   12-15mm       15
4  3mm-10mm       10
Emma
  • 27,428
  • 11
  • 44
  • 69
1

We can use this simple regex to replace everything that is before -:

  import pandas as pd
   dit={"Size" : ["0","0","5mm","12-15mm","3mm-10mm"] }
   dt = pd.DataFrame(data=dit)
   dt["max_size"] = dt["Size"].str.replace(".*\-","")
   dt
       Size max_size
0         0        0
1         0        0
2       5mm      5mm
3   12-15mm     15mm
4  3mm-10mm     10mm

If you don't need the mm at the end, then:

dt["max_size"] = dt["Size"].str.replace(r".*-|mm$", "")
dt
       Size max_size
0         0        0
1         0        0
2       5mm        5
3   12-15mm       15
4  3mm-10mm       10
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
1

With simple mask and Series.apply function:

In [172]: m = df['Size'].str.contains('-?\d+mm$')                                                                                               

In [173]: df.loc[m, 'Size'] = df.loc[m, 'Size'].apply(lambda x: x.split('-')[-1])                                                              

In [174]: df                                                                                                                                   
Out[174]: 
   Size
0     0
1     0
2   5mm
3  15mm
4  10mm

'-?\d+mm$' pattern details:

  • -? - matches optional - char
  • \d+mm$ - matches one or more digit \d followed by mm suffix at the end of the string $
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105