0

I have some data in chronological order. The index is a date time with minute-level resolution. I store the hour in a column called hour and the minute in a column called minute. I want to trim the start of the data so that I always begin with 00:00. The incoming dataset may begin with some random minute of the day. The data consists of minute-level rows for many days (1000s). So losing part of the first day is not an issue. I just need the data to start at midnight.

I am trying to use the following code to trim my data frame so that is always begins with 00:00.

def clean_start_data (df):
for index, row in df.iterrows():
    if row['hour'] > 0 or row['minute'] > 0:
        df.drop(index, inplace=True)
    else:
        break
return df

But I get stuck and my kernel becomes unresponsive

What am I doing wrong?

EDIT

My data looks like this

h = 9 m = 0 data = blah
h = 9 m = 1 data = blahhbadf
h = 9 m = 2 data = somethning_else
....
h = 0 m = 0 data = something. // new day...I want to start here and remove all rows above

The data covers around 400 days. At h=23 m=59, the h goes back to 0 and minute goes back to 0.

I want to remove from my data the time entries which occur before a new day starts. eg. I want my data to start at h = 0 m = 0.

BYZZav
  • 1,418
  • 1
  • 19
  • 35

1 Answers1

1

I think this is just a simple Boolean filter .

df.loc[(df.hour==0)|(df.minute==0)]

To fix your code

def clean_start_data (df):
    l=[]
    for index, row in df.iterrows():
        if row['hour'] > 0 or row['minute'] > 0:
            l.append(index)
        else:
            break
    return  df.drop(l, inplace=True)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Hi I tried using the simple boolean filter to no avail. I have edited my original question in order to try and make things clearer. – BYZZav May 08 '18 at 15:27
  • The important point is that I want to remove everything until the first time we hit 00:00 and then I I want the function to stop. I don't want to remove any more entries. All I am doing is ensuring the array starts at 00:00. The boolean filter looks through the whole dataset. – BYZZav May 08 '18 at 15:33
  • @Vazzyb you can do `df.loc[((df.hour==0)|(df.minute==0)).idxmax():]` – BENY May 08 '18 at 15:35
  • what works is df[df[(df.hour == 0)&(df.minute == 0)].index[0]:] – BYZZav May 08 '18 at 19:42
  • @Vazzyb your solution only work for first row is not 00:00, think about if you have more than one row is not 00:00 in the head – BENY May 08 '18 at 19:49
  • in this case, my code will just return the index of the first row... – BYZZav May 08 '18 at 19:55
  • also, its definitely an &, not an OR. – BYZZav May 08 '18 at 19:55
  • @Vazzyb df.loc[((df.hour==0)&(df.minute==0)).idxmax():] this one ? – BENY May 08 '18 at 19:57