0

I have a dataframe with 2 columns:

  1. session ID
  2. user action

User actions can be of the types 1,2,3,4,5,6. Each session is composed by a sequence of several user actions.

E.g.

Session AAAAA: 1 1 2 3 1 6 1 1 5

Session BBBBB: 1 2 5

...

I have 400k unique session IDs, however I have 5M rows. Because, each row is a session ID + a user action. Then that same session ID appears more times below, matched with another action.

How the data looks:

Session |  Event

AAAAA | 1

AAAAA | 1

BBBBB | 1

CCCCC | 1

BBBBB | 2

AAAAA | 2

.....

5M rows

But, I need to get the data in the format said above. My 1st column should be the session ID, where I would have the 400k unique session IDs in each row. And then, displayed horizontally along each row, I'd have the sequence of events per session. Their column names could be 'event1', 'event2',... Also, they will be of different lenghts, so the NaN could be left as they are, or filled with 0.

How I'd like the dataframe:

Session | e1 | e2 | ...

AAAAA | 1 | 1 | 2 | 3 | 1 | 6 | 1 | 1 | 5 | 0 | 0 | 0

BBBBB | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0

...

I think it's a very easy thing to do in SQL for example, but I'm just not familiar with this in Python.

funie200
  • 3,688
  • 5
  • 21
  • 34
xcentralx
  • 15
  • 6
  • Sounds as if you need to research df.groupby() – Patrick Artner Dec 01 '20 at 10:48
  • I think you need pivoting 2 columns DataFrame, check Q/A 10. But not 100% sure, because not clear how looks input data, because in question are 2 formats - splitted by whitespaces and then not. – jezrael Dec 01 '20 at 10:53
  • If splitted by whitespaces maybe need `df = df.join(df.pop(' Event').str.split(expand=True).add_prefix('e').fillna(0))` – jezrael Dec 01 '20 at 10:54

0 Answers0