Remove rows conditionally from start of pandas datafrmae

Question

I have some data in chronological order. The index is a date time with minute-level resolution. I store the hour in a column called hour and the minute in a column called minute. I want to trim the start of the data so that I always begin with 00:00. The incoming dataset may begin with some random minute of the day. The data consists of minute-level rows for many days (1000s). So losing part of the first day is not an issue. I just need the data to start at midnight.

I am trying to use the following code to trim my data frame so that is always begins with 00:00.

def clean_start_data (df):
for index, row in df.iterrows():
    if row['hour'] > 0 or row['minute'] > 0:
        df.drop(index, inplace=True)
    else:
        break
return df

But I get stuck and my kernel becomes unresponsive

What am I doing wrong?

EDIT

My data looks like this

h = 9 m = 0 data = blah
h = 9 m = 1 data = blahhbadf
h = 9 m = 2 data = somethning_else
....
h = 0 m = 0 data = something. // new day...I want to start here and remove all rows above

The data covers around 400 days. At h=23 m=59, the h goes back to 0 and minute goes back to 0.

I want to remove from my data the time entries which occur before a new day starts. eg. I want my data to start at h = 0 m = 0.

score 1 · Accepted Answer · answered May 08 '18 at 01:17

1

I think this is just a simple Boolean filter .

df.loc[(df.hour==0)|(df.minute==0)]

To fix your code

def clean_start_data (df):
    l=[]
    for index, row in df.iterrows():
        if row['hour'] > 0 or row['minute'] > 0:
            l.append(index)
        else:
            break
    return  df.drop(l, inplace=True)

answered May 08 '18 at 01:17

BENY

317,841
20
164
234

Hi I tried using the simple boolean filter to no avail. I have edited my original question in order to try and make things clearer. – BYZZav May 08 '18 at 15:27
The important point is that I want to remove everything until the first time we hit 00:00 and then I I want the function to stop. I don't want to remove any more entries. All I am doing is ensuring the array starts at 00:00. The boolean filter looks through the whole dataset. – BYZZav May 08 '18 at 15:33
@Vazzyb you can do `df.loc[((df.hour==0)|(df.minute==0)).idxmax():]` – BENY May 08 '18 at 15:35
what works is df[df[(df.hour == 0)&(df.minute == 0)].index[0]:] – BYZZav May 08 '18 at 19:42
@Vazzyb your solution only work for first row is not 00:00, think about if you have more than one row is not 00:00 in the head – BENY May 08 '18 at 19:49
in this case, my code will just return the index of the first row... – BYZZav May 08 '18 at 19:55
also, its definitely an &, not an OR. – BYZZav May 08 '18 at 19:55
@Vazzyb df.loc[((df.hour==0)&(df.minute==0)).idxmax():] this one ? – BENY May 08 '18 at 19:57

Remove rows conditionally from start of pandas datafrmae

1 Answers1