1

For this pandas DataFrame (that is in reality much longer), I would like to get the value of b and date, where b is minimum and b is maximum for that day. Performance is an issue.

   b                date
0  1 1999-12-29 23:59:12
1  2 1999-12-29 23:59:13
2  3 1999-12-29 23:59:14
3  3 1999-12-30 23:59:12
4  1 1999-12-30 23:59:13
5  2 1999-12-30 23:59:14
6  2 1999-12-31 23:59:12
7  3 1999-12-31 23:59:13
8  1 1999-12-31 23:59:14

So I would to get

   b                date
0  1 1999-12-29 23:59:12
2  3 1999-12-29 23:59:14

3  3 1999-12-30 23:59:12
4  1 1999-12-30 23:59:13

7  3 1999-12-31 23:59:13
8  1 1999-12-31 23:59:14

This is how the dataframe gets generated:

import datetime
import pandas as pd
df = pd.DataFrame({"a": ["29.12.1999 23:59:12",
                         "29.12.1999 23:59:13",
                         "29.12.1999 23:59:14",

                         "30.12.1999 23:59:12",
                         "30.12.1999 23:59:13",
                         "30.12.1999 23:59:14",

                         "31.12.1999 23:59:12",
                         "31.12.1999 23:59:13",
                         "31.12.1999 23:59:14"],
                   "b": [1,
                         2,
                         3,

                         3,
                         1,
                         2,

                         2,
                         3,
                         1]})
df["date"] = pd.to_datetime(df.a)
df.drop(["a"],axis=1,inplace=True)
user7468395
  • 1,299
  • 2
  • 10
  • 23
  • And what code or ideas have you come up with to handle this problem? – gosuto Jan 04 '19 at 21:43
  • SQL approach: two queries with (order by) + (limit 1) if you just need the min and max and nothing else. This is much better than fetching 10,000+ rows to get 2 values. – Dave S Jan 04 '19 at 21:55

3 Answers3

1

First convert the date to date format , then we sort the value b using sort_values , and using drop_duplicates to get what you need

df=df.assign(days=df.date.dt.date).sort_values('b')
yourdf=pd.concat([df.drop_duplicates('days'),df.drop_duplicates('days',keep='last')]).\
        sort_index().\
          drop('days',1)
yourdf
Out[242]: 
   b                date
0  1 1999-12-29 23:59:12
2  3 1999-12-29 23:59:14
3  3 1999-12-30 23:59:12
4  1 1999-12-30 23:59:13
7  3 1999-12-31 23:59:13
8  1 1999-12-31 23:59:14
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Maybe not the most performant due to the iteration of dates, but:

df['true_date'] = df['date'].dt.date
date_min_max = df.groupby('true_date')['b'].agg(['min','max'])

result = pd.DataFrame(columns=['b','date'])
for date, min_max_series in date_min_max.iterrows():
    date_values = df[(df['true_date'] == date) & df['b'].isin(min_max_series)][['b','date']]
    result = result.append(date_values)

Out[170]: 
   b                date
0  1 1999-12-29 23:59:12
2  3 1999-12-29 23:59:14
3  3 1999-12-30 23:59:12
4  1 1999-12-30 23:59:13
7  3 1999-12-31 23:59:13
8  1 1999-12-31 23:59:14
it's-yer-boy-chet
  • 1,917
  • 2
  • 12
  • 21
1
>>> dfg = df.set_index('date').groupby(pd.Grouper(freq='D'))

>>> df['dailyMin'] = df['date'].isin(dfg.idxmin()['b'])
>>> df['dailyMax'] = df['date'].isin(dfg.idxmax()['b'])

>>> df[df[['dailyMin', 'dailyMax']].any(axis=1)]

   b                date  dailyMin  dailyMax
0  1 1999-12-29 23:59:12      True     False
2  3 1999-12-29 23:59:14     False      True
3  3 1999-12-30 23:59:12     False      True
4  1 1999-12-30 23:59:13      True     False
7  3 1999-12-31 23:59:13     False      True
8  1 1999-12-31 23:59:14      True     False

This might not be the most effective way to do this. I have my doubts about .isin(). Also this partially depends on your dataset---see this discussion: Select the max row per group - pandas performance issue

gosuto
  • 5,422
  • 6
  • 36
  • 57