0

I'm working with time series Pandas dataframe where for each company there are several observations indexed by date. For each observation there are multi values. I've been experimenting with pandas MultiIndex vs stack/unstack and have gotten close to desired result but am stuck:

Here's what I've tried -- full code to reproduce:

import pandas.util.testing as tm; tm.N = 3
import numpy as np

def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])

df = unpivot(tm.makeTimeDataFrame())
df['topic'] = df.groupby('variable').cumcount()+1
df['company'] = np.random.choice(['Apple Inc', 'Google', 'Nike Corp'], len(df))

This is what I start with:

         date variable     value  topic    company
0  2000-01-03        A -1.250926      1  Apple Inc
1  2000-01-04        A -0.807887      2     Google
2  2000-01-05        A  0.302593      3  Nike Corp
3  2000-01-03        B  0.502474      1  Nike Corp
4  2000-01-04        B  0.974968      2  Apple Inc
5  2000-01-05        B  0.698478      3     Google
6  2000-01-03        C  0.871437      1  Nike Corp
7  2000-01-04        C  0.235059      2  Apple Inc
8  2000-01-05        C -1.827321      3     Google
9  2000-01-03        D -0.764457      1     Google
10 2000-01-04        D  0.014109      2  Nike Corp
11 2000-01-05        D -0.966902      3     Google

My goal is something like this:

topic                                 1         2         3
company   variable date                                    
Apple Inc A        2000-01-03      -1.2       NaN       NaN
          B        2000-01-04       NaN       0.9       NaN
          C        2000-01-04       NaN       0.2       NaN
Google    A        2000-01-04       NaN      -0.8       NaN
          B        2000-01-05       NaN       NaN       0.69


And I've tried this but 'date' is not correct:

df.set_index(['company', 'variable', 'date'], inplace=False).unstack()
                        value                            topic             \
date               2000-01-03 2000-01-04 2000-01-05 2000-01-03 2000-01-04   
company   variable                                                          
Apple Inc A         -1.250926        NaN        NaN        1.0        NaN   
          B               NaN   0.974968        NaN        NaN        2.0   
          C               NaN   0.235059        NaN        NaN        2.0   
Google    A               NaN  -0.807887        NaN        NaN        2.0   
          B               NaN        NaN   0.698478        NaN        NaN   
          C               NaN        NaN  -1.827321        NaN        NaN   
          D         -0.764457        NaN  -0.966902        1.0        NaN   
Nike Corp A               NaN        NaN   0.302593        NaN        NaN   
          B          0.502474        NaN        NaN        1.0        NaN   
          C          0.871437        NaN        NaN        1.0        NaN   
          D               NaN   0.014109        NaN        NaN        2.0   


date               2000-01-05  
company   variable             
Apple Inc A               NaN  
          B               NaN  
          C               NaN  
Google    A               NaN  
          B               3.0  
          C               3.0  
          D               3.0  
Nike Corp A               3.0  
          B               NaN  
          C               NaN  
          D               NaN 
colbythenoob
  • 95
  • 2
  • 9
  • 1
    Your 'goal' looks very strange. The 'topic' column there actually contains a 3 column matrix? I believe this is not possible in pandas. Could you write down a piece of code that would manually create a dataframe which contains the data you'd like to have at the end? – godfryd Jan 24 '19 at 21:01
  • Could you repost your starting dataframe, doing something like: `print(df.to_csv())`? It's hard to consume it as it's currently given. – PMende Jan 24 '19 at 21:04
  • Did you just change your whole question ? – BENY Jan 24 '19 at 21:10
  • sure @PMende ,date,variable,value,topic,company 0,2000-01-03,A,0.14520927505913114,1,Apple Inc 1,2000-01-04,A,1.3305625699847172,2,Apple Inc 2,2000-01-05,A,-0.5092440923096634,3,Google 3,2000-01-03,B,-0.5331379822609876,1,Apple Inc 4,2000-01-04,B,-1.8819470784218326,2,Google 5,2000-01-05,B,0.4639895484513931,3,Apple Inc 6,2000-01-03,C,-0.009324650000272683,1,Apple Inc 7,2000-01-04,C,1.43392801995639,2,Nike Corp 8,2000-01-05,C,-2.1387438527839673,3,Google 9,2000-01-03,D,1.1612391157834347,1,Nike Corp 10,2000-01-04,D,-1.7026024568155145,2,Apple Inc – colbythenoob Jan 24 '19 at 21:10
  • Hey @godfryd had a bug in code. there is a value column so you were correct in the matrix size. I just updated per you point – colbythenoob Jan 24 '19 at 21:11
  • I think what it's missing is just unstack the `topic` index. `df.set_index(['company', 'variable', 'date', 'topic']).unstack('topic')`. Does this give the result you need? – Peng Jan 24 '19 at 21:50

1 Answers1

2

Check with get_dummies

s=df.set_index(['company','variable','date']).topic.astype(str).str.get_dummies()
s.mul(s.columns).sort_index(level=0).sum(level=[0,1,2])
Out[335]: 
                              1  2  3
company  variable date               
AppleInc A        2000-01-03  1      
         B        2000-01-05        3
         D        2000-01-03  1      
                  2000-01-05        3
Google   A        2000-01-04     2   
                  2000-01-05        3
         B        2000-01-03  1      
         C        2000-01-03  1      
                  2000-01-04     2   
         D        2000-01-04     2   
NikeCorp B        2000-01-04     2   
         C        2000-01-05        3

Update since the question update

df.pivot_table(index=['company','variable','date'],columns='topic',values='value')
BENY
  • 317,841
  • 20
  • 164
  • 234