1

I have a dataframe which contains various datapoints over MULTIPLE years from ONE unit. This unit number is listed in the first column of my dataframe, named 'Unit'. The year variable is in the second column.

For visalisation, this is a mini version of my dataset. In reality is the size: 55 columns by 700000 rows.

import random
col3=[random.randrange(1,101,1) for _ in range (14)]
col4=[random.randrange(1,101,1) for _ in range (14)]

d = {'Unit': [1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6], 
     'Year': [2014, 2015, 2016, 2017, 2015, 2016, 2017, 2017, 2014, 2015, 2014, 2015, 2016, 2017], 'col3' : col3, 'col4' : col4 }
df = pd.DataFrame(data=d)

With this dataset I want to look at the ratios between col3 and col4 within a year and between years. For this reason I want to make a three dimensional dataframe, which places year on an additional axis and not as a variable in my 2D frame.

Does anyone have tips on how to do this? and is this a good approach? suggestions?

Jen

jenny
  • 143
  • 7
  • Do you need [pivot](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe) ? – jezrael Jun 05 '19 at 12:13
  • Perhaps using the [`MultiIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html)? Or create separate frame for each year and do comparisons that way? – David Zemens Jun 05 '19 at 12:15
  • 1
    There was something called a [Panel](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Panel.html) in pandas, which might be what you want? But it's deprecated now (look into the link for alternatives). – flurble Jun 05 '19 at 12:27

1 Answers1

3

Because Panel is deprecated now, here is alternative with MultiIndex:

df1 = df.set_index(['Unit','Year']).unstack()
print (df1)
      col3                    col4                  
Year  2014  2015  2016  2017  2014  2015  2016  2017
Unit                                                
1     58.0  80.0  37.0  79.0  12.0  32.0  22.0  39.0
2      NaN  90.0  46.0  98.0   NaN  73.0  95.0  28.0
3      NaN   NaN   NaN  52.0   NaN   NaN   NaN  90.0
4     90.0   NaN   NaN   NaN  92.0   NaN   NaN   NaN
5      NaN   9.0   NaN   NaN   NaN   3.0   NaN   NaN
6     33.0  22.0  41.0  40.0  21.0   5.0  69.0  10.0

df2 = df.set_index(['Year','Unit']).unstack()
print (df2)
      col3                               col4                             
Unit     1     2     3     4    5     6     1     2     3     4    5     6
Year                                                                      
2014  58.0   NaN   NaN  90.0  NaN  33.0  12.0   NaN   NaN  92.0  NaN  21.0
2015  80.0  90.0   NaN   NaN  9.0  22.0  32.0  73.0   NaN   NaN  3.0   5.0
2016  37.0  46.0   NaN   NaN  NaN  41.0  22.0  95.0   NaN   NaN  NaN  69.0
2017  79.0  98.0  52.0   NaN  NaN  40.0  39.0  28.0  90.0   NaN  NaN  10.0

Another way is use xarray.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252