0

As quite a newbie to pandas, I'm struggling with a data arrangement issue.

I've got a huge pile of data from a log file in a pandas dataframe with a structure like this:

day   user   measure1   measure2   ...
1     u1     xxxxx      yyyyy      ...
1     u2     xxxxx      yyyyy      ...
1     u3     xxxxx      yyyyy      ...
2     u2     xxxxx      yyyyy      ...
2     u4     xxxxx      yyyyy      ...
2     u3     xxxxx      yyyyy      ...
3     u1     xxxxx      yyyyy      ...
3     u3     xxxxx      yyyyy      ...
...   ...    ...        ...        ...

Hence, not every user appears at each day, while the data is neither sorted by day nor by user. However, if an entry occurs, is has all the measures.

Now I need to rearrange this data to obtain a 2D table "every user" vs. "every day" for each measure and fill the gaps with zeros e.g.

For measure1:                      For measure2:
      u1     u2     u3     u4            u1     u2     u3     u4
1  xxxxx  xxxxx  xxxxx      0      1  yyyyy  yyyyy  yyyyy      0  
2      0  xxxxx  xxxxx  xxxxx      2      0  yyyyy  yyyyy  yyyyy  
3  xxxxx      0  xxxxx      0      3  yyyyy      0  yyyyy      0  

How can I do this with pandas in python3? I'm also open to alternative solutions e.g. using numpy instead of pandas.

So far I managed to extract arrays of all occurring users and days in the dataset but have no clue how to smartly assign the measured data.

I'm grateful for any help on this matter.

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
MadMad
  • 3
  • 1

2 Answers2

0

It seems like you want a multi-index dataframe (index1: day, index2: measure)

The tricky part is that you might need to transpose your dataframe before these operations. Have a look at the answer of this issue which looks similar to yours Constructing 3D Pandas DataFrame

Hope it helps

zar3bski
  • 2,773
  • 7
  • 25
  • 58
0

You need set_index and unstack

df.set_index(['day','user']).measure1.unstack(fill_value=0)
Out[6]: 
user     u1     u2     u3     u4
day                             
1     xxxxx  xxxxx  xxxxx      0
2         0  xxxxx  xxxxx  xxxxx
3     xxxxx      0  xxxxx      0
df.set_index(['day','user']).measure2.unstack(fill_value=0)
Out[7]: 
user     u1     u2     u3     u4
day                             
1     yyyyy  yyyyy  yyyyy      0
2         0  yyyyy  yyyyy  yyyyy
3     yyyyy      0  yyyyy      0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Genius, that's wonderfull. I punched together a workaroud building a new Dataframe with usernames as columns and loop over days to fill it, which turned out to be quite slow. Your one-liner is just perfect. Thanks! – MadMad Feb 14 '18 at 18:43
  • @MadMad yw :-) happy coding – BENY Feb 14 '18 at 18:58