0

I am trying to create a new column in my df using numerical data from another column. I attempted using a for loop and a series of if statements to categorize the numerical data into strings that I want to now use to create the new column. The following data is from the WNBA 2010-2011 dataset about the players.

def clean(col):  
    for xp in col:
        if xp < 1:
            print('Rookie')
        elif ((xp >= 1) and (xp <= 3)):
            print('Little experience')
        elif ((xp >= 4) and (xp <= 5)):
            print('Experienced')
        elif ((xp > 5) and (xp < 10)):
            print('Very experienced')
        elif (xp > 10):
            print("Veteran")

I tried using series.apply() and series.map() but both of these return a new column called XP as follows

XP = df.Experience.apply(clean) 
df['XP'] = XP

However, when I checked the dtypes it says that the newly created column is a NONETYPE object. Is this because I am using the print function in the for loop as opposed to manipulating the actual value? If so what should I do to return the string values specified?

Thanks in advance for the help.

  • The best answer to your title is that you shouldn't. That `.apply` is a slow loop, in pandas you would choose to use `np.select`: https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column – ALollz Jun 26 '20 at 15:37

2 Answers2

1

That's because your function doesn't return anything (so returns None by default). You need to replace those print statements with return.

Also, you don't need to loop over the column in your function - apply does that for you in a vectorized way. Try this:

def clean(xp):  
    if xp < 1:
        return 'Rookie'
    elif ((xp >= 1) and (xp <= 3)):
        return 'Little experience'
    elif ((xp >= 4) and (xp <= 5)):
        return 'Experienced'
    elif ((xp > 5) and (xp < 10)):
        return 'Very experienced')
    elif (xp > 10):
        return "Veteran"

df['XP'] = df.Experience.apply(clean)

Bear in mind also that the way your equalities are currently written, your function will return None if xp == 10.

SimonR
  • 1,774
  • 1
  • 4
  • 10
  • Thanks Simon for the reply. I tried what you suggested and it gave me back a type error TypeError: 'int' object is not iterable. should have mentioned that the df.Experience column is an interfere dtype. – Oscar Agbor Jun 26 '20 at 13:48
  • Restarted the kernel and got this error TypeError: '<' not supported between instances of 'str' and 'int' Any ideas? – Oscar Agbor Jun 26 '20 at 14:01
  • 1
    Hey Oscar, thay could be because some of your input is a string, maybe my code solves this. – Martijniatus Jun 26 '20 at 14:26
1
df = pd.DataFrame({'xp':[0,2,4,6,20,'4']})

Put in a string because you had the type error.

def clean(str_xp):
     xp = int(str_xp)
     if xp < 1: 
         return('Rookie') 
     elif ((xp >= 1) and (xp <= 3)): 
         return('Little experience') 
     elif ((xp >= 4) and (xp <= 5)): 
         return('Experienced') 
     elif ((xp > 5) and (xp < 10)): 
         return('Very experienced') 
     elif (xp > 10): 
         return ("Veteran") 

df['rank'] = df['xp'].apply(clean) 

df returns:

   xp               rank
0   0             Rookie
1   2  Little experience
2   4        Experienced
3   6   Very experienced
4  20            Veteran
5   4        Experienced
Martijniatus
  • 102
  • 3
  • Thank you for you reply. I was wondering why on index number 5 of your output, the xp value is 4 where the string '5' was used in the initial list? Sorry I am new to python. I'd appreciate some elaboration if you could? Thanks again – Oscar Agbor Jun 26 '20 at 14:41
  • 1
    Edited it now, now its like its supposed to be – Martijniatus Jun 26 '20 at 14:46