0

I have a problem. I have a table like this with roughly 500k rows and 190 unique dates:

Date Return
1 0.1
2 0.2
1 0.3
2 0.5
1 0.4
2 0.3

And I have to format it so that its grouped by date as an index and all returns are on the one row like this:

Date Return Return 2 Return 3
1 0.1 0.3 0.4
2 0.2 0.5 0.3

or the same but dates as columns

and preferably as numpy array. How to do that? ATM my Jupyter just crashes as my solution is too heavy.

Riku T
  • 11
  • 2
  • 1
    It's Q & A # 10 in [the canonical](https://stackoverflow.com/q/47152691/15497888) `new_df = df.pivot_table(index='Date', columns=df.groupby('Date').cumcount(), values='Return')` – Henry Ecker Feb 01 '22 at 18:45
  • Some additional formatting options in [my answer here](https://stackoverflow.com/a/68535973/15497888): `new_df = df.pivot_table(index='Date', columns=df.groupby('Date').cumcount() + 1, values='Return').add_prefix('Return ').reset_index()` – Henry Ecker Feb 01 '22 at 18:48
  • Thanks for the help! Though this did not fully solve my problem. It loses thousands of observations. In addition, there is ID column that identifies the returns for a specific firm, and most of them does not have observation throughout the whole time period. This changes the idea as there are bunch of rows which probably needs to fill with Nulls – Riku T Feb 01 '22 at 20:02
  • I'm glad we were able to solve this part of the issue. The options above in the linked threads and the answer below all answer the shown sample input and output. If you have conditions that are not covered in this question, you could ask a new question which includes data that demonstrates the issue. Come up with a sample DataFrame that more accurately resembles your actual problem. Like DataFrame with an ID column and missing observations etc. Then someone can help you solve those aspects as well. – Henry Ecker Feb 01 '22 at 20:26
  • Got it. I managed to figure out rest myself. Though I would have never gotten it through without your help. I just filtered out firms with fewer observations than periods. Like this: `df = df.groupby("firm").filter(lambda x: len(x) == periods)` and now it works. Thanks! – Riku T Feb 01 '22 at 20:56

1 Answers1

0

Use pivot_table:

out = df.assign(col='Return ' + df.groupby('Date').cumcount().add(1).astype(str)) \
        .pivot_table('Return', 'Date', 'col').reset_index().rename_axis(None, axis=1)
print(out)

# Output
   Date  Return 1  Return 2  Return 3
0     1       0.1       0.3       0.4
1     2       0.2       0.5       0.3
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks for the effort! I used the solution above but yours was also really helpful (and functioning). Once again I learned something new. – Riku T Feb 01 '22 at 21:00