0

I have a dataframe with multiindexing and a lot of rows. The indices are 'item' and 'TimeStamp'

Each of the items have a different number of elements as some of the values were NaN and they were erased from the dataset. I would like to regenerate the lost rows and obtain a new dataframe described below.

Ideally I would like to:

  1. create a new dataframe with the full DateTime index with a step of 10 minutes. It's size would be (full DateTimeIndex x number of items)
  2. each column would contain data for a separate item and the rows where the data is missing would be NaN. The column names would refer to item numbers ('I01', 'I02'... etc.)

This way I would remove the multiindexing and be able to perform quicker operations on a 2D df.

The df I have is as follows:

                                 value
item TimeStamp                             
I01     2011-09-20 00:00:00         -11.280400
        2011-09-20 00:10:00         -11.945430
        2011-09-20 00:20:00         -11.962580
        2011-09-20 00:30:00         -12.074700
        2011-09-20 00:40:00         -11.923750
                                       ...
I07     2014-05-31 23:20:00         985.375427
        2014-05-31 23:30:00         951.776611
        2014-05-31 23:40:00         822.368286
        2014-05-15 23:50:00         879.974792
        2014-06-01 00:00:00         587.804321

[nevermind how many rows x 1 columns]

I will be really grateful for any help with this. I am quite new to Python.

mats_snaps
  • 57
  • 6

1 Answers1

0

You can resolve this issue by assuring that the same timestamps are used for each item group. This can be done through pivoting, filling in dummies and, finally, unpivoting your data.

Note that, in the below example, the timestamp 2011-09-20 00:20:00 is missing for item IO2. Our goal is to retrieve this timestamp - and align the timestamps of all items in our dataset.

data = [['I01', '2011-09-20 00:00:00', 10], ['I01' , '2011-09-20 00:10:00', 20], 
        ['I01' , '2011-09-20 00:20:00', 20], ['I02', '2011-09-20 00:00:00', 30], 
        ['I02' , '2011-09-20 00:10:00', 40]]

df = pd.DataFrame(data, index=[0,1,2,3,4], columns=['Item', 'Timestamp', 'Value'])


  Item             Timestamp value
0  I01   2011-09-20 00:00:00  10.0
1  I02   2011-09-20 00:00:00  30.0
2  I01   2011-09-20 00:10:00  20.0
3  I02   2011-09-20 00:10:00  40.0
4  I01   2011-09-20 00:20:00  20.0

To achieve this, we pivot the table with timestamps as our column values as follows:

df = df.pivot_table(index=['Item'], columns='Timestamp')

              Value                                        
Timestamp     2011-09-20 00:00:00   2011-09-20 00:10:00     2011-09-20 00:20:00
I01                     10.0                20.0                     20.0
I02                     30.0                40.0                      nan

# Note that a nan value appears for 'I02' and timestamp '2011-09-20 00:20:00'

Now, we fill in the N/A's with a dummy value (e.g. the float 0.). This prevents the row from disappearing once we unpivot.

Finally, we unpivot (i.e. melt) the table to retrieve the old data structure:

df = df.fillna('dummy').reset_index()
df = df.melt(id_vars='Item')
df = df.replace('dummy', 0.)

  Item             Timestamp  value
0  I01   2011-09-20 00:00:00   10.0
1  I02   2011-09-20 00:00:00   30.0
2  I01   2011-09-20 00:10:00   20.0
3  I02   2011-09-20 00:10:00   40.0
4  I01   2011-09-20 00:20:00   20.0
5  I02   2011-09-20 00:20:00    0.0

Now, however, the same timestamps are used for each item group.

Edit: I found a one-liner for this issue as well over here:

df.set_index(['Timestamp', 'Item']).unstack().fillna('dummy').stack().replace('dummy', 0.)
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Max
  • 21
  • 7