applying regex to a pandas dataframe

Question

I'm having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:

               Name   Season          School   G    MP  FGA  3P  3PA    3P%
 74       Joe Dumars  1982-83   McNeese State  29   NaN  487   5    8  0.625   
 84      Sam Vincent  1982-83  Michigan State  30  1066  401   5   11  0.455   
 176  Gerald Wilkins  1982-83     Chattanooga  30   820  350   0    2  0.000   
 177  Gerald Wilkins  1983-84     Chattanooga  23   737  297   3   10  0.300   
 243    Delaney Rudd  1982-83     Wake Forest  32  1004  324  13   29  0.448

I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.

Here is what I put together:

import re

def split_it(year):
    return re.findall('(\d\d\d\d)', year)

 df['Season2'] = df['Season'].apply(split_it(x))

TypeError: expected string or buffer

Output would be a column called Season2 that contains the year before the hyphen. I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong

Thanks for any help in advance.

score 63 · Accepted Answer · answered Aug 13 '14 at 18:01

When I try (a variant of) your code I get NameError: name 'x' is not defined-- which it isn't.

You could use either

df['Season2'] = df['Season'].apply(split_it)

or

df['Season2'] = df['Season'].apply(lambda x: split_it(x))

but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:

>>> df["Season"].apply(split_it)
74     [1982]
84     [1982]
176    [1982]
177    [1983]
243    [1982]
Name: Season, dtype: object

although you could easily change that. FWIW, I'd use vectorized string operations and do something like

>>> df["Season"].str[:4].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

or

>>> df["Season"].str.split("-").str[0].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

realized i asked the question wrong and had what you gave me. my error was coming b/c i had NaN values in the year further down the dataframe. I found that out by trying df["Season"].str.split("-").str[0].astype(int). Thanks anyways though, really appreciate it — itjcms18, Aug 13 '14 at 19:15
when I want to lowercase the values, ```df['Season2'] = df['Season'].apply(lower())``` doesn't work, I have to use lambda — CN_Cabbage, May 10 '22 at 06:08

score 33 · Answer 2 · answered Jan 27 '20 at 10:42

33

You can simply use str.extract

df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')

Here you locate \d{4}-\d{2} (for example 1982-83) but only extracts the captured group between parenthesis \d{4} (for example 1982)

answered Jan 27 '20 at 10:42

Gabriel

857
10
14

1

Is case there are more than one matches, is there a way to separate the matches with a ```,``` ? – Marrluxia Jul 22 '22 at 08:37

score 11 · Answer 3 · edited Mar 18 '18 at 22:21

11

The asked problem can be solved by writing the following code :

import re
def split_it(year):
    x = re.findall('([\d]{4})', year)
    if x :
      return(x.group())

df['Season2'] = df['Season'].apply(split_it)

You were facing this problem as some rows didn't had year in the string

edited Mar 18 '18 at 22:21

Peter Barrett Bryan

645
4
13

answered Apr 26 '16 at 13:01

Pratik409

301
2
10

score 3 · Answer 4 · answered Aug 30 '20 at 05:20

3

you can use pandas native function to do it too.

check this page for the pandas functions that accepts regular expression. for your case, you can do

df["Season"].str.extract(r'([\d]{4}))')

answered Aug 30 '20 at 05:20

Julian Eccleshall

444
5
11

score 0 · Answer 5 · edited Mar 11 '19 at 03:59

0

I had the exact same issue. Thanks for the answers @DSM. FYI @itjcms, you can improve the function by removing the repetition of the '\d\d\d\d'.

def split_it(year):  
    return re.findall('(\d\d\d\d)', year)

Becomes:

def split_it(year):
    return re.findall('(\d{4})', year)

edited Mar 11 '19 at 03:59

MarredCheese

17,541
8
92
91

answered Mar 11 '19 at 02:08

Tony Alleven

31
2

score -1 · Answer 6 · answered Oct 25 '21 at 12:36

-1

I'd extract with:

df['Season2']=df['Season'].str.extract(r'(\d{4}))

answered Oct 25 '21 at 12:36

Simon

117
2
10

applying regex to a pandas dataframe

6 Answers6

Linked