0

It's the first time that I am using a dataset in .h5 format for which I am assigning keys to a new variable using the code below

# Assign every meter to a variable  
house3_aggregate = ukdale.get('/building3/elec/meter1')
house3_kettle = ukdale.get('/building3/elec/meter2')
house3_electric_heater = ukdale.get('/building3/elec/meter3')
house3_laptop = ukdale.get('/building3/elec/meter4')
house3_projector = ukdale.get('/building3/elec/meter5')

Printing the first variable gives me the following structure:

print(house3_aggregate)

physical_quantity           power
type                        apparent
2013-02-27 20:35:14+00:00   5.0
2013-02-27 20:35:20+00:00   4.0
2013-02-27 20:35:26+00:00   5.0
2013-02-27 20:35:32+00:00   5.0
2013-02-27 20:35:38+00:00   4.0
... ...
2013-04-08 06:14:22+01:00   168.0
2013-04-08 06:14:28+01:00   171.0
2013-04-08 06:14:34+01:00   176.0
2013-04-08 06:14:40+01:00   174.0
2013-04-08 06:14:53+01:00   3122.0
512327 rows × 1 columns

After using the .columns to get a better understanding of what I am seeing here I get the following output:

MultiIndex([('power', 'apparent')],
           names=['physical_quantity', 'type'])

I want to transform this dataset into a dataset that only has two columns like so

Timestamp                   Energy Consumption
2013-02-27 20:35:14+00:00   5.0
2013-02-27 20:35:20+00:00   4.0
2013-02-27 20:35:26+00:00   5.0
2013-02-27 20:35:32+00:00   5.0
2013-02-27 20:35:38+00:00   4.0
... ...
2013-04-08 06:14:22+01:00   168.0
2013-04-08 06:14:28+01:00   171.0
2013-04-08 06:14:34+01:00   176.0
2013-04-08 06:14:40+01:00   174.0
2013-04-08 06:14:53+01:00   3122.0
512327 rows × 1 columns

I found this post and I tried to implement the code like so

house3_aggregate = house3_aggregate.reset_index()
print(house3_aggregate)

This gives me the following output

physical_quantity                     index    power
type                                        apparent
0                 2013-02-27 20:35:14+00:00      5.0
1                 2013-02-27 20:35:20+00:00      4.0
2                 2013-02-27 20:35:26+00:00      5.0
3                 2013-02-27 20:35:32+00:00      5.0
4                 2013-02-27 20:35:38+00:00      4.0
...                                     ...      ...
512322            2013-04-08 06:14:22+01:00    168.0
512323            2013-04-08 06:14:28+01:00    171.0
512324            2013-04-08 06:14:34+01:00    176.0
512325            2013-04-08 06:14:40+01:00    174.0
512326            2013-04-08 06:14:53+01:00   3122.0

[512327 rows x 2 columns]

Besides the wrong column naming this seems like what I need. Expect that wen I call .column again the data still seems to be a MultiIndex

house3_aggregate.columns
MultiIndex([('index',         ''),
            ('power', 'apparent')],
           names=['physical_quantity', 'type'])

I also tried this approach and used the code below

house3_aggregate = house3_aggregate.to_frame()
print(house3_aggregate)

This gives me the following error

AttributeError: 'DataFrame' object has no attribute 'to_frame'

I am confused now as this error makes it seem like the data is already two-dimensional. Any idea on what I should do to get rid of the Multindex and transform the data in a two-dimensional data frame like below? Ultimately I want to use pd.concat to merge all the variables into a single dataset for house3

Timestamp                   Energy Consumption
2013-02-27 20:35:14+00:00   5.0
2013-02-27 20:35:20+00:00   4.0
2013-02-27 20:35:26+00:00   5.0
2013-02-27 20:35:32+00:00   5.0
2013-02-27 20:35:38+00:00   4.0
... ...
2013-04-08 06:14:22+01:00   168.0
2013-04-08 06:14:28+01:00   171.0
2013-04-08 06:14:34+01:00   176.0
2013-04-08 06:14:40+01:00   174.0
2013-04-08 06:14:53+01:00   3122.0
512327 rows × 1 columns
MxGr20
  • 77
  • 6

1 Answers1

1

A DataFrame consists of three parts: The index, the columns and the values. You can see that in your DataFrame:

Out[31]: 
physical_quantity            power
type                      apparent
2013-02-27 20:35:14+00:00         
2013-02-27 20:35:20+00:00      4.0
2013-02-27 20:35:26+00:00      5.0
2013-02-27 20:35:32+00:00      5.0
2013-02-27 20:35:38+00:00      4.0

In [38]: df.index
Out[38]: 
Index(['2013-02-27 20:35:20+00:00', '2013-02-27 20:35:26+00:00',
       '2013-02-27 20:35:32+00:00', '2013-02-27 20:35:38+00:00'],
      dtype='object', name='index')

In [34]: df.columns
Out[34]: 
MultiIndex([('power', 'apparent')],
           names=['physical_quantity', 'type'])


In [32]: df.values
Out[32]: 
array([[4.],
       [5.],
       [5.],
       [4.]])

You can set each of these however you like as long as the sizes stay consistent to each other. The index must be as large as the number of rows and the columns must correspond to the number of columns.

A Multiindex simply contains a tuple for each column value. This might be very useful if you want to be able to select subsets of your DataFrame easily.

You want to overwrite the column names. First in order to "see" the index name (which I set to 'index' in my example) you can use reset_index

In [39]: df = df.reset_index()
Out[39]: 
physical_quantity                     index    power
type                                        apparent
0                  2013-02-27 20:35:20+0...      4.0
1                  2013-02-27 20:35:26+0...      5.0
2                  2013-02-27 20:35:32+0...      5.0
3                  2013-02-27 20:35:38+0...      4.0

In [41]: df.columns
Out[41]: 
MultiIndex([('index',         ''),
            ('power', 'apparent')],
           names=['physical_quantity', 'type'])

As you see the length of the columns is now 2. Now you can overwrite it with

In [42]: df.columns = ['Timestamp', 'Energy Consumption']

In [43]: df
Out[43]: 
                  Timestamp  Energy Consumption
0  2013-02-27 20:35:20+0...                 4.0
1  2013-02-27 20:35:26+0...                 5.0
2  2013-02-27 20:35:32+0...                 5.0
3  2013-02-27 20:35:38+0...                 4.0
maow
  • 2,712
  • 1
  • 11
  • 25