Python: Replace all column by output of reg

Question

In my dataset, i have a feature (called Size) like this one:

import pandas as pd


dit={"Size" : ["0","0","5mm","12-15","3-10"] }
dt = pd.DataFrame(data=dit)

This feature specifies a size in a range (with minimum and maximum) or by a specific number.

Now, i wish to replace the values by the Maximum of each range.

So, in my example the output should be

dit={"Size" : ["0","0","5mm","15mm","10mm"] }

This is what i have tried:

import re
dt = re.split("-",dt.loc[:,"Size"])

But, it complains with:

TypeError: expected string or bytes-like object

and this is makes sense because the type of dt.loc[:," Size (in mm)"] is:

pandas.core.series.Series

My question is, how can i simply update this column without a loop?

Could you add a sample of `dt`? – NelsonGon Jul 21 '19 at 06:44 — NelsonGon, Jul 21 '19 at 06:44
@NelsonGon i updated my question. it was missed. Sorry – Jeff Jul 21 '19 at 06:45 — Jeff, Jul 21 '19 at 06:45

Wiktor Stribiżew · Answer 1 · 2019-07-21T07:50:20.653

3

You may use

import pandas as pd
import re

dit={"Size" : ["0","0","5mm","12-15","3-10"] }
dt = pd.DataFrame(data=dit)
rx = r'(\d+)(?:mm)?-(\d+)(?:mm)?'
dt['Size']=dt['Size'].apply(lambda x: re.sub(rx, lambda z: str(max(int(z.group(1)), int(z.group(2)))) + "mm", x))

Output:

>>> dt
   Size
0     0
1     0
2   5mm
3  15mm
4  10mm

The regex matches

(\d+) - Group 1: one or more digits
(?:mm)? - an optional mm char sequence
- - a hyphen
(\d+) - Group 2: one or more digits
(?:mm)? - - an optional mm char sequence.

The str(max(int(z.group(1)), int(z.group(2)))) + "mm" keeps the biggest value of the numbers in two groups and appends mm at the end.

edited Jul 21 '19 at 07:50

answered Jul 21 '19 at 06:54

Wiktor Stribiżew

607,720
39
448
563

if i have `Nan` instead of `Zeros`, then what can i do? – Jeff Jul 21 '19 at 06:56
@Jeff Add `.fillna(0)`. See [How can I replace all the NaN values with Zero's in a column of a pandas dataframe](https://stackoverflow.com/questions/13295735). – Wiktor Stribiżew Jul 21 '19 at 06:58
Typically, you'd use `str.replace` for string/regex replacement: `dt['Size'].str.replace(rx, lambda z: z.group(1) if int(z.group(1)) > int(z.group(2)) else z.group(2))`. – cs95 Jul 21 '19 at 07:24

Emma · Accepted Answer · 2019-07-21T06:56:51.787

1

This expression might return the desired output, which I have assumed mm is not desired:

import pandas as pd

dit={"Size" : ["0","0","5mm","12-15mm","3mm-10mm"] }
dt = pd.DataFrame(data=dit)

dt['max_size']=dt["Size"].str.replace(r"^.*?(\d+)\D*$",r"\1")

print(dt)

Output

       Size max_size
0         0        0
1         0        0
2       5mm        5
3   12-15mm       15
4  3mm-10mm       10

edited Jul 21 '19 at 06:56

answered Jul 21 '19 at 06:54

Emma

27,428
11
44
69

if i have `Nan` instead of `Zeros`, then what can i do? – Jeff Jul 21 '19 at 06:56

NelsonGon · Answer 3 · 2019-07-21T07:25:29.597

1

We can use this simple regex to replace everything that is before -:

  import pandas as pd
   dit={"Size" : ["0","0","5mm","12-15mm","3mm-10mm"] }
   dt = pd.DataFrame(data=dit)
   dt["max_size"] = dt["Size"].str.replace(".*\-","")
   dt
       Size max_size
0         0        0
1         0        0
2       5mm      5mm
3   12-15mm     15mm
4  3mm-10mm     10mm

If you don't need the mm at the end, then:

dt["max_size"] = dt["Size"].str.replace(r".*-|mm$", "")
dt
       Size max_size
0         0        0
1         0        0
2       5mm        5
3   12-15mm       15
4  3mm-10mm       10

edited Jul 21 '19 at 07:25

answered Jul 21 '19 at 06:59

NelsonGon

13,015
7
27
57

Thanks, but i donot need `mm` in output – Jeff Jul 21 '19 at 07:14
Ok, I'll edit. Just followed this: `So, in my example the output should be dit={"Size" : ["0","0","5mm","15mm","10mm"] }` – NelsonGon Jul 21 '19 at 07:17

score 1 · Answer 4 · answered Jul 21 '19 at 07:09

With simple mask and Series.apply function:

In [172]: m = df['Size'].str.contains('-?\d+mm$')                                                                                               

In [173]: df.loc[m, 'Size'] = df.loc[m, 'Size'].apply(lambda x: x.split('-')[-1])                                                              

In [174]: df                                                                                                                                   
Out[174]: 
   Size
0     0
1     0
2   5mm
3  15mm
4  10mm

'-?\d+mm$' pattern details:

-? - matches optional - char
\d+mm$ - matches one or more digit \d followed by mm suffix at the end of the string $

@Jeff, then why did you write *my example the output should be* `["0","0","5mm","15mm","10mm"]` in your question??? — RomanPerekhrest, Jul 21 '19 at 07:15

Python: Replace all column by output of reg

4 Answers4

Output