3

I have a data frame (namend df) from 2016/1/1 00:00 until 2018/11/25 23:00 with a timestamp every hour, object_id and a value. The data set only contains rows where an object_id has a value.

timestampHour     object_id  value
2016/1/1 00:00    1          2
2016/1/1 00:00    3          1
2016/1/1 01:00    1          1
2016/1/1 01:00    2          3
2016/1/1 02:00    2          3
2016/1/1 02:00    3          2

I would like to get a dataframe showing all object id's for every hour, with a null value if there is no value.

timestampHour     object_id  value
2016/1/1 00:00    1          2
2016/1/1 00:00    2          null
2016/1/1 00:00    3          1
2016/1/1 01:00    1          1
2016/1/1 01:00    2          3
2016/1/1 01:00    3          null
2016/1/1 02:00    1          null
2016/1/1 02:00    2          3
2016/1/1 02:00    3          2

I have created the dateTime from timestamps. And rounded them to hours with the following code:

df["timestamp"] = pd.to_datetime(df["result_timestamp"])
df['timestampHour'] = df['result_timestamp'].dt.round('60min')

(I don't know if there are better options, but I have been trying to create timestampHour rows until 12 (I have 12 every unique object_id) and fill those newly created rows with (the for that hour) unused object_id. But I have not been able to create the empty rows, with the condition)

I am fairly new to programming and I am not finding a clue to get closer to solving this problem from searching other posts.

Brohm
  • 143
  • 1
  • 5

2 Answers2

3

Using pivot_table and unstack:

df.pivot_table(
    index='object_id',  columns='timestampHour', values='value'
).unstack().rename('value').reset_index()

    timestampHour  object_id  value
0  2016/1/1 00:00          1    2.0
1  2016/1/1 00:00          2    NaN
2  2016/1/1 00:00          3    1.0
3  2016/1/1 01:00          1    1.0
4  2016/1/1 01:00          2    3.0
5  2016/1/1 01:00          3    NaN
6  2016/1/1 02:00          1    NaN
7  2016/1/1 02:00          2    3.0
8  2016/1/1 02:00          3    2.0

To see why this works, the intermediate pivot_table is helpful to look at:

timestampHour  2016/1/1 00:00  2016/1/1 01:00  2016/1/1 02:00
object_id
1                         2.0             1.0             NaN
2                         NaN             3.0             3.0
3                         1.0             NaN             2.0

Where a value is not found for a combination of object_id and timestampHour, a NaN is added to the table. When you use unstack, these NaN's are kept, giving you the desired result with missing values represented.

user3483203
  • 50,081
  • 9
  • 65
  • 94
0

This is also .reindex with a cartesian product of the two levels. This question goes into detail on ways to optimize the performance of the product for large datasets.

import pandas as pd

id_cols = ['timestampHour', 'object_id']
idx = pd.MultiIndex.from_product(df[id_cols].apply(pd.Series.unique).values.T, names=id_cols)

df.set_index(id_cols).reindex(idx).reset_index()

Output:

    timestampHour  object_id  value
0  2016/1/1 00:00          1    2.0
1  2016/1/1 00:00          3    1.0
2  2016/1/1 00:00          2    NaN
3  2016/1/1 01:00          1    1.0
4  2016/1/1 01:00          3    NaN
5  2016/1/1 01:00          2    3.0
6  2016/1/1 02:00          1    NaN
7  2016/1/1 02:00          3    2.0
8  2016/1/1 02:00          2    3.0
ALollz
  • 57,915
  • 7
  • 66
  • 89