0

I have dates for each row in my dataframe and want to assign a value to a new column based on a condition of the date.

Normally if I assign a value to a new column, I would do something like this:

def get_mean(df):
   return df.assign(
     grouped_mean = lambda df: df.groupby('group')['X']
       .transform(lambda df: df.mean())
   )

No I am looking for a solution like that, since the solution I have now is very slow and not beautiful.

Is there a better way than my current solution and use assign?

I currently came up with this solution:

def set_season(df):
    df = df.copy()
    for i in df.index:
        if (df.loc[i, 'Date'] >= pd.Timestamp('2008-08-30')) & (df.loc[i, 'Date'] <= pd.Timestamp('2009-05-31')):
            df.at[i, 'season'] = '08-09'
        elif  (df.loc[i, 'Date'] >= pd.Timestamp('2009-08-22')) & (df.loc[i, 'Date'] <= pd.Timestamp('2010-05-16')):
            df.at[i, 'season'] = '09-10'
        elif  (df.loc[i, 'Date'] >= pd.Timestamp('2010-08-28')) & (df.loc[i, 'Date'] <= pd.Timestamp('2011-05-22')):
            df.at[i, 'season'] = '10-11'

    return df
Shaido
  • 27,497
  • 23
  • 70
  • 73
Niels Hoogeveen
  • 365
  • 4
  • 19

2 Answers2

4

In pandas and in most cases in Python general, we want to avoid looping over our data because it can be slower up to factors 1000x. Pandas and numpy provide lot of vectorized solutions, for most of our problem cases. Read more about it here

In your case we can use np.select which lets us define multiple conditions and based on those conditions we define choices.

Plus we can make your code more elegant by using Series.between with the inclusive=True argument.

conditions = [
    df['Date'].between('2008-08-30', '2009-05-31', inclusive=True),
    df['Date'].between('2009-08-22', '2010-05-16', inclusive=True),
    df['Date'].between('2010-08-28', '2011-05-22', inclusive=True)
]

choices = ['08-09', '09-10', '10-11']

df['season'] = np.select(conditions, choices, default='99-99')

Sidenote

We can also rewrite your first function better with removing the two lambda functions and simply assigning the new column with groupby and transform and also taking extra arguments: group & mean_col

def get_mean(df, group, mean_col):

    df['mean'] = df.groupby(group)[mean_col].transform('mean')

    return df

Example

# Example dataframe
df = pd.DataFrame({'Fruit':['Banana', 'Strawberry', 'Apple', 'Banana', 'Apple'],
                   'Weight':[10, 12, 8, 9, 14]})

        Fruit  Weight
0      Banana      10
1  Strawberry      12
2       Apple       8
3      Banana       9
4       Apple      14

get_mean(df, 'Fruit', 'Weight')

        Fruit  Weight  mean
0      Banana      10   9.5
1  Strawberry      12  12.0
2       Apple       8  11.0
3      Banana       9   9.5
4       Apple      14  11.0
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    beat me to it - `select` is the correct way to go about this. – asongtoruin May 31 '19 at 08:03
  • @Erfan. This is great for True cases, how about the 'False' cases. For eg, if it not between the first condition? Do we need to define a separate condition and choices for them? or is there a work around like np.where where you can provide choices for true and False conditions?Thanks. – Gopinath S May 18 '20 at 22:06
0

Use .apply() method, if the new column 'season' depends only on one column:

def your_function(date):
    """
    takes a date a returns a string season
    """
    # code your function here

df['season'] = df['Date'].apply(your_function)

if your new column 'season' depends on multiple other columns, use axis = 1:

def your_function(row):
    """
    takes a row from your dataframe and returns a result
    """
    # code your function here
    # example if you want a sum of col1, col2, col3
    return row['col1'] + row['col2'] + row['col3']

df['season'] = df.apply(your_function, axis = 1)
ichafai
  • 331
  • 2
  • 9