2

I have a pandas dataframe that looks like this:

index p1 a1 phase file_number e1
388 19.288 21.630 0.0 0 0.0
389 40.910 71.489 1.0 0 0.0
390 31.310 43.952 2.0 0 0.0
391 28.420 30.250 3.0 0 0.0
392 17.940 22.000 0.0 1 0.0
393 38.020 68.750 1.0 1 0.0
394 31.230 48.352 2.0 1 1.0
395 26.902 29.880 3.0 1 0.0

We can create it using this code

d = {'p1': {388: 19.288,389: 40.91,390: 31.31,391: 28.42,392: 17.94,393: 38.02,394: 31.23,395: 26.902},
     'a1': {388: 21.63,389: 71.489,390: 43.952,391: 30.25,392: 22.0,393: 68.75,394: 48.352,395: 29.88},
     'phase': {388: 0.0,389: 1.0,390: 2.0,391: 3.0,392: 0.0,393: 1.0,394: 2.0,395: 3.0},
     'file_number': {388: 0, 389: 0, 390: 0, 391: 0, 392: 1, 393: 1, 394: 1, 395: 1},
     'e1': {388: 0.0,389: 0.0,390: 0.0,391: 0.0,392: 0.0,393: 1.0,394: 0.0,395: 0.0}}

df = pd.DataFrame(d)

For I want to transform this dataframe, so I have 1 row for every file_number. And transform it with respect to phase - basically collapse many rows into one for each file_number. Phase number will always be 0, 1, 2, 3. Final table should look like this:

p1_0 p1_1 p1_2 p1_3 a1_0 p1_1 a1_2 a1_3 e1_0 e1_1 e1_2 e1_3
19.288 40.910 31.310 28.420 21.630 71.489 43.952 30.250 0 0 0 0
17.940 38.020 31.230 26.902 22.000 68.750 48.352 29.880 0 0 1 0

Where suffix means p1_phase, a1_phase and so on.

I want to make it as fast as possible. Since my data is very large, I'd rather avoid looping.

  • there are numerous answers on Stack Overflow for this. It's called pivoting in pandas. Look at `stack`, `pivot` or `pivot_table` – sammywemmy Jul 14 '21 at 11:40
  • Read Q/A 10 in [this question](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe). – Quang Hoang Jul 14 '21 at 11:47

1 Answers1

1
d = {'p1': {388: 19.288,389: 40.91,390: 31.31,391: 28.42,392: 17.94,393: 38.02,394: 31.23,395: 26.902},
     'a1': {388: 21.63,389: 71.489,390: 43.952,391: 30.25,392: 22.0,393: 68.75,394: 48.352,395: 29.88},
     'phase': {388: 0.0,389: 1.0,390: 2.0,391: 3.0,392: 0.0,393: 1.0,394: 2.0,395: 3.0},
     'file_number': {388: 0, 389: 0, 390: 0, 391: 0, 392: 1, 393: 1, 394: 1, 395: 1},
     'e1': {388: 0.0,389: 0.0,390: 0.0,391: 0.0,392: 0.0,393: 1.0,394: 0.0,395: 0.0}}

df = pd.DataFrame(d)
# pivot the data
pivoted = df.pivot(index='file_number', columns='phase')
# flatten the columns
pivoted.columns = [f'{col[0]}_{int(col[1])}'  for col in pivoted.columns.values]

After this pivoted is a Dataframe with the shape you desired.

Basically a combination of these two questions:

Chris Wesseling
  • 6,226
  • 2
  • 36
  • 72
  • Thanks, this one worked! The way that I found was res = pd.pivot_table(df, index=['file_number', 'phase']).unstack().reset_index(drop=True) res.columns = [f'{col[0]}_{int(col[1])}' for col in res.columns] but your solution is cleaner – Evangelista01 Jul 14 '21 at 15:01
  • :) That's about the same. If file_number is unimportant, it makes sense to re-index and drop it. – Chris Wesseling Jul 14 '21 at 15:26