I have this problem, I am trying to convert a data frame (loaded from a CSV files with millions of lines) with this structure:
| start | end | type | value |
|---------------------|---------------------|------|-------|
| 2016-01-01 00:00:00 | 2016-01-02 00:00:00 | 0 | 200 |
| 2016-01-02 01:00:00 | 2016-01-03 00:00:00 | 1 | 100 |
| 2016-01-15 08:00:00 | 2016-01-16 07:00:00 | 0 | 15 |
| 2016-01-16 07:00:00 | 2016-01-16 07:00:00 | 2 | 80 |
And I would like to convert it into a structure with this format:
| timestamp | 0 | 1 | 2 |
|---------------------|-----|-----|---|
| 2016-01-01 00:00:00 | 200 | 0 | 0 |
| ... | 200 | 0 | 0 |
| 2016-01-02 00:00:00 | 200 | 0 | 0 |
| 2016-01-02 01:00:00 | 0 | 100 | 0 |
| ... | 0 | 100 | 0 |
| 2016-01-03 00:00:00 | 0 | 100 | 0 |
| ... | 0 | 0 | 0 |
| 2016-01-15 08:00:00 | 15 | 0 | 0 |
In other word, while the first table specifies the start and the end period of the event of type
N with its value
, I would like to have in the end a table with a range of datetimes a for each of them the values included for all the events.
I am trying to find an efficient solution and the best I have found is based on the conversion from a date time to an integer (using the hours since a base date) and then using this value as index for a numpy
array. Unfortunately, my code is using a for loop and I was wondering if you could come up with something better.
import pandas as pd
import numpy as np
# Example data frame
df = pd.DataFrame({'start': ['2016-01-01 00:00:00', '2016-01-02 01:00:00', '2016-01-15 08:00:00', '2016-01-16 07:00:00'],
'end': ['2016-01-02 00:00:00', '2016-01-03 00:00:00', '2016-01-16 07:00:00', '2016-01-16 07:00:00'],
'id': [0, 1, 0, 2],
'x': [200, 100, 15, 80]})
# Convert the strings in datetimes
df['start'] = pd.to_datetime(df['start'], format='%Y-%m-%d %H:%M:%S')
df['end'] = pd.to_datetime(df['end'], format='%Y-%m-%d %H:%M:%S')
# Get the date time offset
OFFSET = pd.datetime(2016, 1, 1, 0, 0, 0).timestamp() # this is the first date time I have
# Convert the dates in integers (conversion to nanoseconds and then to hours
df['start'] = ((df['start'].astype(np.int64) / (1e9) - OFFSET) / 3600).astype(np.int32) - 1
df['end'] = ((df['end'].astype(np.int64) / (1e9) - OFFSET) / 3600).astype(np.int32) - 1
# Target data structure
x = np.zeros((1000, 3)) # this must have a number of rows equal to the number of time stamps
# Put the data into the target structure
for i in range(0, 3):
x[df.iloc[i].start:df.iloc[i].end, df.iloc[i].id] = df.iloc[i].x
The conversion from datetime to integer was based on this SO question. My experience in Python is limited (I am mostly a R user) then I hope there is a better (vectorised?) and more elegant solution.
Thank you in advance!