Grouping Pandas dataframe based on conditions?

Question

I am following the suggestions here pandas create new column based on values from other columns but still getting an error. Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time).

df
df['Time'] = pd.to_datetime(df['Time']) 
#making sure Time column read from the csv file is time object

import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)

def f(row):

    if (row['AMP'] > 100) & (row['Time'] > day_1):
        val = 'new_positives'

    elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
        val = 'rec_positives'

    elif (row['AMP'] > 100 & row['Time'] < day_2):
        val = 'old_positives'

    else:
        val = 'old_negatives'

    return val

df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')

df[(df['AMP'] > 100) & (df['Time'] > day_1)]  #this works fine

df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)]  #this works fine

df[(df['AMP'] > 100) & (df['Time'] < day_2)]  #this works fine


#df = df.groupby('GRP')

I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. What is the correct approach to group the dataframe based on the conditions listed?

EDIT:

Unforunately, I cannot provide a sample of my dataframe. However, here is simple dataframe that gives an error of the same type:

import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
   'b':np.random.rand(10)})

def f1(row):
    if row['a'] < 5 & row['b'] < 0.5:
        value = 'less'
    elif row['a'] < 5 & row['b'] > 0.5:
        value = 'more'
    else:
        value = 'same'
    return value

mydf['GRP'] = mydf.apply(f1, axis=1)

ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')

EDIT 2: As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. This problem is solved.

However, I am still getting the same error in my my real example. By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. This shows the problem is related to using df['Time']. But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? Why would this work in this context, but not when the condition appears in a function?

My actual dataframe has company confidential information and I can't share it per company policy. However, I did cook up a quick example which illustrates the same error type. Please see the EDIT. — Murchak, Oct 24 '18 at 00:37

Peter Leimbigler · Accepted Answer · 2018-10-24T12:13:12.880

2

Based on your error message and example, there are two things to fix. One is to adjust parentheses for operator precedence in your final elif statement. The other is to avoid mixing datetime.date and Timestamp objects.

Fix 1: change this:

elif (row['AMP'] > 100 & row['Time'] < day_2):

to this:

elif (row['AMP'] > 100) & (row['Time'] < day_2):

These two lines are different because the bitwise & operator takes precedence over the < and > comparison operators, so python attempts to evaluate 100 & row['Time']. A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence

Fix 2: Change these 3 lines:

import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)

to these 2 lines:

day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)

edited Oct 24 '18 at 12:13

answered Oct 24 '18 at 00:44

Peter Leimbigler

10,775
1
23
37

Thanks! For the trivial example that I shown, adding the parentheses fixed the problem. However, the same didn't work for my actual example and I get the error: TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0'). I am trying to understand what this error means. What does index 0 refer to? 1st row of data in the dataframe? – Murchak Oct 24 '18 at 01:19
Yes, index 0 refers to the first row in whatever Series or DataFrame is causing the error. What is your pandas version? (Check in python with `pd.__version__`) – Peter Leimbigler Oct 24 '18 at 01:26
pd.__version__ = 0.20.3 (python 3x) – Murchak Oct 24 '18 at 01:30
Gotcha. I realized that no matter what your version (0.23 is out by the way!), the fastest fix is this: change `day_1 = dt.date.today()` to `day1 = pd.to_datetime('today')`, and change `day_2 = dt.date.today() - dt.timedelta(days = 1)` to `day_2 = day_1 - pd.DateOffset(days=1)`. I've edited this into my answer. – Peter Leimbigler Oct 24 '18 at 01:33
If I were to pass day_1 as an argument to the function, e.g., f(row, day_1), instead of hard coding it, how would I modify df.apply(f, axis = 1) statement? Or maybe even f(row, day_1, day_2)... – Murchak Oct 24 '18 at 06:58
@Murchak, that should work with minimal changes to your code. You'd call the apply with something like `df.apply(row, day_1=pd.to_datetime('2018-02-17'), day_1=pd.to_datetime('2018-02-16'))`. Or did you mean somehow programmatically selecting `day_1` and `day_2` instead of manually specifying them? That is probably best asked as a new question. – Peter Leimbigler Oct 24 '18 at 12:31

score 1 · Answer 2 · answered Oct 24 '18 at 00:43

Some parentheses need to be added in the if-statements:

import numpy as np
import pandas as pd

mydf = pd.DataFrame({'a':np.arange(10),
   'b':np.random.rand(10)})

def f1(row):
    if (row['a'] < 5) & (row['b'] < 0.5):
        value = 'less'
    elif (row['a'] < 5) & (row['b'] > 0.5):
        value = 'more'
    else:
        value = 'same'
    return value

mydf['GRP'] = mydf.apply(f1, axis=1)

edesz · Answer 3 · 2018-10-24T01:48:27.907

If you don't need to use a custom function, then you can use multiple masks (somewhat similar to this SO post)

For the Time column, I used this code. It may be that you were trying to compare Time column values that did not have the required dtype (??? this is my guess)

import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)

Here is the raw data

mydf

   a         b       Time
0  0  0.550149 2018-10-14
1  1  0.889209 2018-10-15
2  2  0.845740 2018-10-16
3  3  0.340310 2018-10-17
4  4  0.613575 2018-10-18
5  5  0.229802 2018-10-19
6  6  0.013724 2018-10-20
7  7  0.810413 2018-10-21
8  8  0.897373 2018-10-22
9  9  0.175050 2018-10-23

One approach involves using masks for columns

# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf

   a         b       Time   GRP
0  0  0.550149 2018-10-14  same
1  1  0.889209 2018-10-15  same
2  2  0.845740 2018-10-16  same
3  3  0.340310 2018-10-17  less
4  4  0.613575 2018-10-18  same
5  5  0.229802 2018-10-19  same
6  6  0.013724 2018-10-20  same
7  7  0.810413 2018-10-21  same
8  8  0.897373 2018-10-22  same
9  9  0.175050 2018-10-23  same

Another approach is to set a, b and Time as a multi-index and use index-based masks to set values

mydf.set_index(['a','b','Time'], inplace=True)

# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')

# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf

   a         b       Time   GRP
0  0  0.550149 2018-10-14  same
1  1  0.889209 2018-10-15  same
2  2  0.845740 2018-10-16  same
3  3  0.340310 2018-10-17  less
4  4  0.613575 2018-10-18  same
5  5  0.229802 2018-10-19  same
6  6  0.013724 2018-10-20  same
7  7  0.810413 2018-10-21  same
8  8  0.897373 2018-10-22  same
9  9  0.175050 2018-10-23  same

Source to filter by datetime and create a range of dates.

thanks. the df with columns 'a' and 'b' work just fine (with applying the function) now with the addition of parentheses as pointed out above. it is the real example that still gives the error. — Murchak, Oct 24 '18 at 01:31
I have edited my answer to include the `Time` column that you are referring to in the OP. Note: I had to change your definitions of `date_1` and `date_2` - I made some guesses for both of these variables. — edesz, Oct 24 '18 at 01:50

score 0 · Answer 4 · answered Oct 24 '18 at 12:29

You have a excelent example here, it is very useful and you could apply filters after groupby. It is a way without using mask.

def get_letter_type(letter):
   if letter.lower() in 'aeiou':
       return 'vowel'
   else:
       return 'consonant'


In [6]: grouped = df.groupby(get_letter_type, axis=1)

https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html

Grouping Pandas dataframe based on conditions?

4 Answers4