3

My data frame is as follows

selection_id  last_traded_price
430494        1.46
430494        1.48
430494        1.56
430494        1.57
430495        2.45
430495        2.67
430495        2.72
430495        2.87

I have lots of rows that contain selection id's and I need to keep selection_id column the same but transpose the data in last traded price to look like this.

selection_id  last_traded_price
430494        1.46              1.48          1.56      1.57    e.t.c 
430495        2.45              2.67          2.72      2.87    e.t.c

I've tried a to use a pivot

   (df.pivot(index='selection_id', columns=last_traded_price', values='last_traded_price')

Pivot isn't working due to duplicate rows in selection_id. is it possible to transpose the data first and drop the duplicates after?

cs95
  • 379,657
  • 97
  • 704
  • 746
tomoc4
  • 337
  • 2
  • 10
  • 29

2 Answers2

3

Option 1
groupby + apply

v = df.groupby('selection_id').last_traded_price.apply(list)
pd.DataFrame(v.tolist(), index=v.index)

                 0     1     2     3
selection_id                        
430494        1.46  1.48  1.56  1.57
430495        2.45  2.67  2.72  2.87

Option 2
You can do this with pivot, as long as you have another column of counts to pass for the pivoting (it needs to be pivoted along something, that's why).

df['Count'] = df.groupby('selection_id').cumcount()
df.pivot('selection_id', 'Count', 'last_traded_price')

Count            0     1     2     3
selection_id                        
430494        1.46  1.48  1.56  1.57
430495        2.45  2.67  2.72  2.87
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Now it is perfectly OK, only you can change `pd.Series.tolist` to `list` for better answer ;) – jezrael Jan 19 '18 at 10:22
  • @jezrael Thank you so much, that's exactly what I could benefit from... – cs95 Jan 19 '18 at 10:23
  • @jezrael thank dude your answer solved my issue. I would rate best answer but I don't have enough reputation to do that. I will when I do! – tomoc4 Jan 19 '18 at 20:02
  • @tomoc4 you can still accept his answer. You can also upvote both answers, because I think they're good. – cs95 Jan 19 '18 at 20:37
  • @coldspeed I tried running both solutions and for some reason my output isn't what I expected. Maybe I wasn't specific enough in the original question. for some reason selection_id is printing along columns also. I need to transpose last_traded_price first then after that's done drop the rows that contain duplicate selection_id's. I'd like to mention also I'm trying this on 4 million rows! – tomoc4 Jan 20 '18 at 22:43
  • @tomoc4 my answer gives the same thing that the accepted answer does. Are you saying my answer isn't working? Specifically, what part is wrong? If you can explain in more detail, I'll see if I can give you an alternative. – cs95 Jan 20 '18 at 22:51
  • @coldspeed both answers actually print the same rows and columns sizes so are identical. Each row that contains selection_id in the corresponding column contains a price movement (last_traded_price). What I'm trying to capture is the range of price movements related to the selection_id. I've just pasted a sample of the column data here if you want to try. https://pastebin.com/8K6ktrdK I'm not sure why the selection_id's eventually print along the columns it might be due to using selection_id as an index – tomoc4 Jan 20 '18 at 23:01
  • Would it be possible to create a list of all last_traded_price rows relating to the selection_id instead? – tomoc4 Jan 20 '18 at 23:03
  • @tomoc4 if I understood, it seems you can just call result.reset_index() to remove it from the index. – cs95 Jan 20 '18 at 23:05
  • @coldspeed I just tried this on the sample paste I made and it works fine. It must be an index issue. How do I call result.reset_index() ? – tomoc4 Jan 20 '18 at 23:17
  • @tomoc4 If you're using my code, change the second line in option 1 to `pd.DataFrame(v.tolist(), index=v.index).reset_index()` – cs95 Jan 20 '18 at 23:17
  • @coldspeed that's not working. It's corrupting the data for some reason I get this format 1565,"[3.0499999999999998, 3.0499999999999998, 3.0499999999999998, 3.0499999999999998, 3.0499999999999998, 3.0499999999999998, 3.5499999999999998, 3.5, 3.6000000000000001, 3.6000000000000001, 3.5, 3.5, 3.5, 3.5499999999999998, 3.5499999999999998, 3.5499999999999998, 3.5499999999999998, 3.5499999999999998, 3 – tomoc4 Jan 20 '18 at 23:33
  • @tomoc4 That isn't corrupt data, that's just floating point inaccuracies. See here: https://stackoverflow.com/questions/588004/is-floating-point-math-broken – cs95 Jan 20 '18 at 23:34
  • @coldspeed I just figured that out. Thanks man you helped me a lot and I learnt new things about pandas. because of you. Thanks again! – tomoc4 Jan 20 '18 at 23:38
  • @tomoc4 No worries, anytime. – cs95 Jan 20 '18 at 23:40
1

You can use cumcount for Counter for new columns names created by set_index + unstack or pandas.pivot:

g = df.groupby('selection_id').cumcount()
df = df.set_index(['selection_id',g])['last_traded_price'].unstack()
print (df)
                 0     1     2     3
selection_id                        
430494        1.46  1.48  1.56  1.57
430495        2.45  2.67  2.72  2.87

Similar solution with pivot:

df = pd.pivot(index=df['selection_id'], 
              columns=df.groupby('selection_id').cumcount(), 
              values=df['last_traded_price'])
print (df)
                 0     1     2     3
selection_id                        
430494        1.46  1.48  1.56  1.57
430495        2.45  2.67  2.72  2.87
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252