0

Current data frame: I have a pandas data frame where each employee has a text code(all codes start with T) and an associated frequency right next to the code. All text codes have 8 characters.

+----------+-------------------------------------------------------------+
|  emp_id  |   text                                                      |
+----------+-------------------------------------------------------------+
|   E0001  | [T0431516,-8,T0401531,-12,T0517519,12]                      |
|   E0002  | [T0701540,-1,T0431516,-2]                                   |
|   E0003  | [T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]|
|   E0004  | [T0516319,-3]                                               |
|   E0005  | [T0431516,2]                                                |
+----------+-------------------------------------------------------------+

Expected data frame: I am trying to make the text codes present in the data frame as individual columns and if an employee has a frequency for that code then populate frequency else 0.

+----------+----------------------------------------------------------------------------------------+
|  emp_id  | T0431516 | T0401531 | T0517519 | T0701540 | T0421531 |  T0516319 | T0500371 | T0309711 |                                      
+----------+----------------------------------------------------------------------------------------+
|   E0001  | -8       | -12      | 12       | 0        | 0        | 0         | 0        | 0        |
|   E0002  | -2       | 0        | 0        | -1       | 0        | 0         | 0        | 0        |
|   E0003  | 0        | 0        | -1       | 0        | -7       | 9         | -6       | -3       |
|   E0004  | 0        | 0        | 0        | 0        | 0        | -3        | 0        | 0        |
|   E0005  | 2        | 0        | 0        | 0        | 0        | 0         | 0        | 0        |
+----------+----------------------------------------------------------------------------------------+

Sample data:

pd.DataFrame({'emp_id' : {0: 'E0001', 1: 'E0002', 2: 'E0003', 3: 'E0004', 4: 'E0005'},
                'text' :  {0: '[T0431516,-8,T0401531,-12,T0517519,12]', 1: '[T0701540,-1,T0431516,-2]', 2: '[T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]', 3: '[T0516319,-3]', 4: '[T0431516,2]'}
                })

So, far my attempts were unsuccessful. Any pointers/help is much appreciated!

Rudr
  • 387
  • 4
  • 20
  • Do all of the codes begin with `T` in the entire dataset? – MattR Oct 19 '20 at 21:10
  • 1
    Those are not valid lists btw – Chris Oct 19 '20 at 21:15
  • 1
    The question itself is interesting, but manually parsing that data table into a dataframe is just too painful.... Please take a look at the way to [share dataframes](https://stackoverflow.com/questions/20109391) on this site. – Bill Huang Oct 19 '20 at 21:35
  • @BillHuang my bad, I totally forgot that! pls check now. – Rudr Oct 20 '20 at 15:59

1 Answers1

1

You can explode the dataframe and then create a pivot_table:

df = pd.DataFrame({'emp_id' : ['E0001', 'E0002', 'E0003', 'E0004', 'E0005'],
                  'text' : [['T0431516',-8,'T0401531',-12,'T0517519',12],
                 ['T0701540',-1,'T0431516',-2],['T0517519',-1,'T0421531',-7,'T0516319',9,'T0500371',-6,'T0309711',-3],
                 ['T0516319',-3], ['T0431516',2]]})
df = df.explode('text')
df['freq'] = df['text'].shift(-1)
df = df[df['text'].str[0] == 'T']
df['freq'] = df['freq'].astype(int)
df = pd.pivot_table(df, index='emp_id', columns='text', values='freq',aggfunc = 'sum').fillna(0).astype(int)
df
Out[1]: 
text    T0309711  T0401531  T0421531  T0431516  T0500371  T0516319  T0517519  \
emp_id                                                                         
E0001          0       -12         0        -8         0         0        12   
E0002          0         0         0        -2         0         0         0   
E0003         -3         0        -7         0        -6         9        -1   
E0004          0         0         0         0         0        -3         0   
E0005          0         0         0         2         0         0         0   

text    T0701540  
emp_id            
E0001          0  
E0002         -1  
E0003          0  
E0004          0  
E0005          0  
Rudr
  • 387
  • 4
  • 20
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • Not completely sure why the 'explode' function is not working on the Sample data I uploaded a while ago. Any thoughts? – Rudr Oct 20 '20 at 16:03
  • @rudr, perhaps you need to convert it to a list? – David Erickson Oct 20 '20 at 17:19
  • No error, seems like it just doesn't do anything even after converting to list. – Rudr Oct 20 '20 at 18:23
  • @Rudr are you setting the results for all lines of code? e.g. `df = df.explode('text')` instead of `df.explode('text')`. Also check this line of code: `df = df[df['text'].str[0] == 'T']`. In your actual data, do the strings start with `T`? – David Erickson Oct 20 '20 at 18:44
  • @Rudr did you ever figure this out? – David Erickson Oct 25 '20 at 22:23
  • 1
    Yes, underlying data had issues. It worked perfectly after fixing them and converting to list. Thanks again! – Rudr Oct 28 '20 at 19:17