0

I have a Pandas data frame with three columns, two of which have identifiers (date and id) and one with the values I actually care about (value). It looks like this:

,date,id,value
0,20210801,269277473,-1114389.6
1,20210802,269277473,-1658061.0
2,20210803,269277473,-1338010.2
3,20210804,269277473,-475779.6
4,20210805,269277473,-1417980.0
5,20210806,269277473,-1673400.6
6,20210807,269277473,-1438969.8
12,20210801,269277476,504300.0
13,20210802,269277476,519889.8
14,20210803,269277476,513899.4
15,20210804,269277476,526258.8
16,20210805,269277476,524730.0
17,20210806,269277476,548010.6
18,20210807,269277476,539031.0
24,20210801,269277480,477399.0
25,20210802,269277480,443499.0
26,20210803,269277480,394801.2
27,20210804,269277480,440100.0
28,20210805,269277480,455499.6
29,20210806,269277480,441100.2
30,20210807,269277480,438899.4

I want to roll the values into a table in which the date in the index, the columns are the ids, and the content is the values, like the following:

date,269277473,269277476,269277480
20210801,-1114389.6,504300.0,477399.0
20210802,-1658061.0,519889.8,443499.0
20210803,-1338010.2,513899.4,394801.2
20210804,-475779.6,526258.8,440100.0
20210805,-1417980.0,524730.0,455499.6
20210806,-1673400.6,548010.6,441100.2
20210807,-1438969.8,539031.0,438899.4

Given my table us huge (hundreds of millions of values), what is the most efficient way of accomplishing this?

2 Answers2

3

You need to apply a pivot:

df.pivot(*df)

which is equivalent in your case (as the columns are in order) to:

df.pivot(index='date', columns='id', values='value')

output:

id        269277473  269277476  269277480
date                                     
20210801 -1114389.6   504300.0   477399.0
20210802 -1658061.0   519889.8   443499.0
20210803 -1338010.2   513899.4   394801.2
20210804  -475779.6   526258.8   440100.0
20210805 -1417980.0   524730.0   455499.6
20210806 -1673400.6   548010.6   441100.2
20210807 -1438969.8   539031.0   438899.4
mozway
  • 194,879
  • 13
  • 39
  • 75
2

Use pivot:

>>> df.pivot(index='date', columns='id', values='value')
id        269277473  269277476  269277480
date
20210801 -1114389.6   504300.0   477399.0
20210802 -1658061.0   519889.8   443499.0
20210803 -1338010.2   513899.4   394801.2
20210804  -475779.6   526258.8   440100.0
20210805 -1417980.0   524730.0   455499.6
20210806 -1673400.6   548010.6   441100.2
20210807 -1438969.8   539031.0   438899.4

If you have other columns, you can use pivot_table instead to apply a function on values of each columns (mean, sum, ...)

Corralien
  • 109,409
  • 8
  • 28
  • 52