0

I have a pandas dataframe that looks like this:

User | Query|                                 Filters                 
----------------------------------------------------------------------------------------- 
1    |  abc | [{u'Op': u'and', u'Type': u'date', u'Val': u'1992'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]
1    |  efg | [{u'Op': u'and', u'Type': u'date', u'Val': u'2000'},{u'Op': u'and', u'Type': u'col', u'Val': u'Blue'}] 
1    |  fgs | [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'col', u'Val': u'Red'}]        
2    |  hij | [{u'Op': u'and', u'Type': u'date', u'Val': u'2002'}]  
2    |  dcv | [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]     
2    |  tyu | [{u'Op': u'and', u'Type': u'date', u'Val': u'1999'},{u'Op': u'and', u'Type': u'col', u'Val': u'Yellow'}]     
3    |  jhg | [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'M'}]    
4    |  mlh | [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'}]  

The result that I expect:

User| Query |  date | sex | col
-------------------------------- 
1   | abc   | 1992  |  F  |
1   | efg   | 2000  |     | Blue
1   | fgs   | 2001  |     | Red
2   | hij   | 2002  |     |
2   | dcv   | 2001  |  F  |
2   | tyu   | 1999  |     | Yellow
3   | jhg   | 2001  |     |
4   | mlh   | 2001  |  H  |

I'm using pandas 0.21.0 with python 2.7.

Example data:

df = pd.DataFrame([{'user': 1,'query': 'abc', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'1992'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 1,'query': 'efg', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2000'},{u'Op': u'and', u'Type': u'col', u'Val': u'Blue'}]},
              {'user': 1,'query': 'fgs', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'col', u'Val': u'Red'}]},
              {'user': 2 ,'query': 'hij', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2002'}]},
              {'user': 2 ,'query': 'dcv', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 2 ,'query': 'tyu', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'1999'},{u'Op': u'and', u'Type': u'col', u'Val': u'Yellow'}]},
              {'user': 3 ,'query': 'jhg', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'M'}]},
              {'user': 4 ,'query': 'mlh', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'}]},
             ])

I tried many solutions:

Any suggestions would be much appreciated!

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
Omar14
  • 2,007
  • 4
  • 21
  • 34

2 Answers2

5

Assuming you have already imported your data, as defined in your MCWE:

data = [{'user': 1,'query': 'abc', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'1992'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 1,'query': 'efg', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2000'},{u'Op': u'and', u'Type': u'col', u'Val': u'Blue'}]},
              {'user': 1,'query': 'fgs', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'col', u'Val': u'Red'}]},
              {'user': 2 ,'query': 'hij', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2002'}]},
              {'user': 2 ,'query': 'dcv', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 2 ,'query': 'tyu', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'1999'},{u'Op': u'and', u'Type': u'col', u'Val': u'Yellow'}]},
              {'user': 3 ,'query': 'jhg', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'M'}]},
              {'user': 4 ,'query': 'mlh', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'}]},
             ]

Then, you are looking for Pandas json_normalize method for data normalization:

from pandas.io.json import json_normalize
df = json_normalize(data, 'Filters', ['query', 'user'])

It returns a normalized DataFrame version where your column of json is expanded into eponymous typed columns:

     Op  Type     Val  user query
0   and  date    1992     1   abc
1   and   sex       F     1   abc
2   and  date    2000     1   efg
3   and   col    Blue     1   efg
4   and  date    2001     1   fgs
5   and   col     Red     1   fgs
6   and  date    2002     2   hij
7   and  date    2001     2   dcv
8   and   sex       F     2   dcv
9   and  date    1999     2   tyu
10  and   col  Yellow     2   tyu
11  and  date    2001     3   jhg
12  and   sex       M     3   jhg
13  and  date    2001     4   mlh

Now, you would pivot your DataFrame to convert Type modalities into columns:

df = df.pivot_table(index=['user', 'query', 'Op'], columns='Type', aggfunc='first')

It leads to:

                   Val            
Type               col  date   sex
user query Op                     
1    abc   and    None  1992     F
     efg   and    Blue  2000  None
     fgs   and     Red  2001  None
2    dcv   and    None  2001     F
     hij   and    None  2002  None
     tyu   and  Yellow  1999  None
3    jhg   and    None  2001     M
4    mlh   and    None  2001  None

Finally, you can clean and reset index, if they bother you:

df.columns = df.columns.droplevel(0)
df.reset_index(inplace=True)

Which returns your requested MCVE output:

Type  user query   Op     col  date   sex
0        1   abc  and    None  1992     F
1        1   efg  and    Blue  2000  None
2        1   fgs  and     Red  2001  None
3        2   dcv  and    None  2001     F
4        2   hij  and    None  2002  None
5        2   tyu  and  Yellow  1999  None
6        3   jhg  and    None  2001     M
7        4   mlh  and    None  2001  None

Not column

In this final DataFrame the first column seems to be called Type, but it is not. It is instead a Integer Index without Name:

df.index
RangeIndex(start=0, stop=8, step=1)

And Columns index is called Type which does not hold any modality called Type (therefore no column with this name).

df.columns
Index(['user', 'query', 'Op', 'col', 'date', 'sex'], dtype='object', name='Type')

This is why you cannot remove the column Type (column used in pivot_table), because it does not exist.

If you want to remove this fake column, you need to create a new index for rows:

df.set_index(['user', 'query'], inplace=True)

If Column index Name bothers you, you can reset it:

df.columns.name = None

It leads to:

             Op     col  date   sex
user query                         
1    abc    and    None  1992     F
     efg    and    Blue  2000  None
     fgs    and     Red  2001  None
2    dcv    and    None  2001     F
     hij    and    None  2002  None
     tyu    and  Yellow  1999  None
3    jhg    and    None  2001     M
4    mlh    and    None  2001  None

It is a good practice when you create a new index to always check it is unique:

df.index.is_unique
True

Data from file

If your data are in a file, you should first import it into a variable using PSL json module:

import json
with open(path) as file:
    data = json.load(file)

This will do the trick, then come back to the beginning of my answer.

jlandercy
  • 7,183
  • 1
  • 39
  • 57
  • @landercy that DataFrame was the result of reading a json file: df = pd.read_json(path,lines='True') – Omar14 Nov 16 '17 at 14:09
  • I'am getting this error: TypeError: string indices must be integers – Omar14 Nov 16 '17 at 14:22
  • @Omar14, I have built my answer with your MCWE data, if there is a problem it should happen at the import stage. You can use PSL `json.loads` to feed `json_normalize`. – jlandercy Nov 16 '17 at 15:18
  • @Omar14 added code to complete you request. Please mark as answer, if it suits you. – jlandercy Nov 16 '17 at 15:37
  • @jlandercy Do you know how to drop Type column. I tried df.drop(columns=['Type']) – Omar14 Nov 16 '17 at 16:00
  • @Omar14, you should always refer to [API](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). `drop` returns a new DataFrame without dropped columns, therefore you must get it: `df = df.drop(labels=['Name'])` or force inplace deletion using the switch: `df.drop(labels=['Name'], inplace=True)`. It appears than Type is not column but instead an Index for rows, there must have one, you can rename it or change it. And type is the name of the column index. Check `df.index` and `df.columns` to confirm this. Hence the confusion – jlandercy Nov 17 '17 at 08:19
1
import pandas as pd

df = pd.DataFrame([{'user': 1,'query': 'abc', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'1992'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 1,'query': 'efg', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2000'},{u'Op': u'and', u'Type': u'col', u'Val': u'Blue'}]},
              {'user': 1,'query': 'fgs', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'col', u'Val': u'Red'}]},
              {'user': 2 ,'query': 'hij', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2002'}]},
              {'user': 2 ,'query': 'dcv', 'Filters': [{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'F'}]},
              {'user': 2 ,'query': 'tyu', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'1999'},{u'Op': u'and', u'Type': u'col', u'Val': u'Yellow'}]},
              {'user': 3 ,'query': 'jhg', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'},{u'Op': u'and', u'Type': u'sex', u'Val': u'M'}]},
              {'user': 4 ,'query': 'mlh', 'Filters':[{u'Op': u'and', u'Type': u'date', u'Val': u'2001'}]},
             ])

def func(x):
    date = x[0]['Val']
    sex = ''
    col = ''
    if len(x) > 1:
        if x[1]['Val'] in ['F','M']:
            sex = x[1]['Val']
        else:
            col = x[1]['Val']      
    return pd.Series([date,sex,col])

df[['date','sex','color']] = df['Filters'].apply(func)

df

Outputs (not showing filter):

  query  user  date sex   color
0   abc     1  1992   F        
1   efg     1  2000        Blue
2   fgs     1  2001         Red
3   hij     2  2002            
4   dcv     2  2001   F        
5   tyu     2  1999      Yellow
6   jhg     3  2001   M        
7   mlh     4  2001            
Anton vBR
  • 18,287
  • 5
  • 40
  • 46