-2

I am trying to divide a huge log data sets containing log data with StartTime and EndTime and other stuff. I am using np.where to compare pandas dataframe object and then to divide it to hourly (may be half hour or quarterly) chunks, depends on hr and timeWindow value.

Below, Here, I am trying to divide the entire day logs to 1 hour chunks, but It does not gives me expected output.

I am out of ideas like where exactly my fault is!

# Holding very first time in the log data and stripping off 
# second, minutes and microseconds.    
today = datetime.strptime(log_start_time, "%Y-%m-%d %H:%M:%S.%f").replace(second = 0, minute = 0, microsecond = 0)
today_ts = int(time.mktime(today.timetuple())*1e9)
hr = 1
timeWindow = int(hr*60*60*1e9) #hour*minute*second*restdigits

parts = [df.loc[np.where((df["StartTime"] >= (today_ts + (i)*timeWindow)) & \
        (df["StartTime"] < (today_ts + (i+1)*timeWindow)))].dropna(axis= 0, \
         how='any') for i in range(0, rngCounter)]

If I check for first log entry inside my parts data, it is something like below:

  1. 00:00:00.
  2. 00:43:23.
  3. 01:12:59.
  4. 01:53:55.
  5. 02:23:52.
  6. ....

Where as I expect the output to be like below:

  1. 00:00:00
  2. 01:00:01
  3. 02:00:00
  4. 03:00:00
  5. 04:00:01
  6. ....

Though I have implemented it using an alternative way, but that's a work around and I lost few features by not having it like this.

Can Someone please figure out what exactly wrong with this approach?

Note: I am using python notebook with pandas, numpy.

Jyotirmay
  • 1,533
  • 3
  • 21
  • 41
  • 4
    Can you please provide some example data? – pansen Jan 05 '18 at 12:46
  • 3
    I'm not sure you need `np.where` at all in `.loc` here. In what way are you not getting expected output? – roganjosh Jan 05 '18 at 12:46
  • 1
    Also, intuitively I think this would be better achieved by something like [`pandas.Grouper`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Grouper.html) with a time period, rather than some list comprehension like this. But we have nothing to test with. Please see [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/4799172) – roganjosh Jan 05 '18 at 12:55

1 Answers1

0

Thanks to @raganjosh.

I got my solution to the problem by using pandas Grouper.

Below is my implementation. I have used dynamic value for 'hr'.

timeWindow = str(hr)+'H'
# Dividing the log into "n" parts. Depends on timewindow initialisation.
df["ST"] = df['StartTime']
df = df.set_index(['ST'])
# Using the copied column as an index.
df.index = pd.to_datetime(df.index)
# Here the parts contain grouped chunks of data as per timewindow, list[0] = key of the group, list[1] = values.
parts = list(df.groupby(pd.TimeGrouper(freq=timeWindow))['StartTime', "ProcessTime", "EndTime"])
Jyotirmay
  • 1,533
  • 3
  • 21
  • 41