0


I have a large dataframe (around 35k entries), the index of this data frame is composed by dates (like 2014-02-12), this dates are not unique. What I need to do is to find for each data what is the max value for each data and create a new data frame with it. I created a solution that works (it is down bellow) but it takes a lot of time to process. Does anyone knows a faster way that I could do this? Thank you.

#Creates a empty dataframe
dataset0514maxrec = pd.DataFrame(columns=dataset0514max.columns.values)
dataset0514maxrec.index.name = 'Date'

#Gets the unique values, find the groups, recover the max value and append it
for i in dataset0514max.index.unique():
    tempDF1 = dataset0514max.loc[dataset0514max.index.isin([i])]
    tempDF2 = tempDF1[tempDF1['Data_Value'] == tempDF1['Data_Value'].max()]
    dataset0514maxrec = dataset0514maxrec.append(tempDF2.head(1))

print(dataset0514maxrec)
cs95
  • 379,657
  • 97
  • 704
  • 746
Filipe Lemos
  • 500
  • 3
  • 13

1 Answers1

4

groupby with levels

df.groupby(level=0).Data_Value.max().reset_index()

The next two options require the index to be a datetime index. If it isn't, convert it:

df.index = pd.to_datetime(df.index) 

resample

df.resample('D').max()

sort_values + duplicated

df = df.sort_values('Data_Value')
m = ~df.index.duplicated()
df = df[m]
cs95
  • 379,657
  • 97
  • 704
  • 746
  • It worked so perfectly, just wanted to know something. what does `level=0` means? – Filipe Lemos Nov 20 '17 at 00:41
  • 1
    @FilipeLemos It groups by the index (`level=0` implies that it should take the index. Usually, index has only one level -> level 0. Multiindex has more levels, in which case, you specify level = 0, level = 1, or similar.) – cs95 Nov 20 '17 at 00:42
  • 1
    `df.resample('D').max()` ... But requires that daily is ok – piRSquared Nov 20 '17 at 00:43
  • @FilipeLemos Please try option 2 and option 3 and let me know if those work as well. – cs95 Nov 20 '17 at 00:44
  • @cᴏʟᴅsᴘᴇᴇᴅ, trying to learn the difference between resample and asfreq, any ideas? – Vaishali Nov 20 '17 at 00:56
  • @Vaishali Hah! Speaking of, an exact such question was once asked :D Here it is: https://stackoverflow.com/questions/18060619/difference-between-asfreq-and-resample – cs95 Nov 20 '17 at 00:57
  • I read that before asking this question. Not much there, is it?:) – Vaishali Nov 20 '17 at 00:59
  • 1
    @Vaishali Yeah, pandas is guilty of having some underwhelming documentation sometimes. Another example is their json_normalize function. Very useful, yet very esoteric. – cs95 Nov 20 '17 at 01:01
  • One more option `df.reset_index().sort_values('Data_Value').drop_duplicates('index')` – BENY Nov 20 '17 at 02:28
  • @Wen Nice one! I improved upon it slightly using `duplicated` so that reset_index is not required. – cs95 Nov 20 '17 at 03:20
  • Ok `dataset2015max.resample('D').max()` worked perfectly but the `asfreq` returned a error `ValueError: cannot reindex from a duplicate axis`. The `resample` solution returns a **Dataframe** and the `groupby` returns a **Series**. – Filipe Lemos Nov 20 '17 at 03:32