4

I'm new to pandas. I'm trying to create the following table:

        |          C         |          Perl      |
        |   line   |   func  |    line   |   func |
version |  covered |  total  |  covered  |  total |

There is the following structure of JSON that I have created:

{
   "version1":{
      "perl":{
         "line_covered":207,
         "line_total":312,
         "func_covered":15,
         "func_total":18
      },
      "C":{
         "line_covered":321,
         "line_total":512,
         "func_covered":10,
         "func_total":10
      }
   }
}

I want to iterate over this JSON and create the table. My first problem is that I can't seem to understand how should I create the header of the table. From previous topics, I found that I should use MultiIndex but for some reason, any combination I tried, does not create the wanted header. Is it possible to show me how to create this table?

vesii
  • 2,760
  • 4
  • 25
  • 71

2 Answers2

2

My approach

import numpy as np
from collections import defaultdict
d = defaultdict(list)

for k,v in my_dict.items():
    d['index'].append(k)
    for k1,v1 in v.items():
        for k2,v2 in v1.items():
            d['columns'].append((k1,k2))
            d['data'].append(v2)
d = dict(d)
d['data'] = np.array(d['data']).reshape(1, len(d['data']))
d['columns'] = pd.MultiIndex.from_tuples(columns)

Build DataFrame

pd.DataFrame(**d)

Output

                 perl                                               C                       
         line_covered line_total func_covered func_total func_covered           func_total   
version1          207        312           15         18           10  version1         10   

use defaultdict but could just start three lists before starting the loop

EDIT

for expected output use

import numpy as np
from collections import defaultdict
d = defaultdict(list)
for k,v in my_dict.items():
    d['index'].append(k)
    for k1,v1 in v.items():
        for k2,v2 in v1.items():
            split = k2.split('_')
            d['columns'].append((k1, split[0]))
            d['data'].append(split[1])
d = dict(d)
d['data'] = np.array(d['data']).reshape(1,len(d['data']))
d['columns'] = pd.MultiIndex.from_tuples(d['columns']).copy()
pd.DataFrame(**d)

Output

             perl                               C       
             line   line     func   func     func   func
version1  covered  total  covered  total  covered  total

Details

print(d)
#{'index': ['version1'], 'columns': MultiIndex([('perl', 'line_covered'),
#            ('perl',   'line_total'),
#            ('perl', 'func_covered'),
#            ('perl',   'func_total'),
#            (   'C', 'func_covered'),
#            (   'C',   'func_total')],
#           ), 'data': array([[207, 312,  15,  18,  10,  10]])}

you can see what** do

ansev
  • 30,322
  • 5
  • 17
  • 31
  • numpy library, you need import numpy as np, – ansev Mar 17 '20 at 20:49
  • I see. For some reason I get `ValueError: cannot reshape array of size 8 into shape (1,1)` on line `d['data'] = np.array(d['data']).reshape(1,len(data))` – vesii Mar 17 '20 at 20:58
  • upps sorry you need `.reshape(1,len(d['data'])` I use `data` and I forgot remove it. Solution update – ansev Mar 17 '20 at 21:01
  • Thanks! I tried to print `d` but I get `{'index': ['version1'], 'columns': MultiIndex(levels=[['C', 'perl'], ['func', 'line']], labels=[[1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 1, 1, 0, 0, 1, 1]]), 'data': array([['covered', 'total', 'covered', 'total', 'covered', 'total', 'covered', 'total']], dtype=' – vesii Mar 17 '20 at 21:09
  • Your are welcome :), We use `**` in order to pass arguments to `pd.DataFrame` object, you can see : https://stackoverflow.com/questions/36901/what-does-double-star-asterisk-and-star-asterisk-do-for-parameters , you can check : `dict(**d) == d` is True – ansev Mar 17 '20 at 21:11
  • Yes, I understand that. But the result, does not show numbers at all. On the third line you get: `version covered total covered total covered total covered total` – vesii Mar 17 '20 at 21:13
  • but if you want to show the values ​​you must use the first code – ansev Mar 17 '20 at 21:15
  • Oh sorry! I though it was the same code. what is `columns` in the first code? – vesii Mar 17 '20 at 21:18
  • check `df.columns` where df is your dataframe – ansev Mar 17 '20 at 21:23
1

TL;DR:

df = pd.json_normalize(my_dict, sep='_')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack(level=0).droplevel(0)

We can use pd.json_normalize. Since it joins the column names at different depths with .s, one can split the column names and create a pd.MultiIndex from the resulting tuples:

>>> df = pd.json_normalize(my_dict)
>>> tuple_cols = df.columns.str.split('.')
>>> df.columns = pd.MultiIndex.from_tuples(tuple(i) for i in tuple_cols)

We can also transpose with zip and use from_arrays:

>>> df = pd.json_normalize(my_dict)
>>> df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('.')))

Either way, df becomes the following:

      version1                                                                                   
          perl                                               C                                   
  line_covered line_total func_covered func_total line_covered line_total func_covered func_total
0          207        312           15         18          321        512           10         10

Because you seem to consider your keys concatenated with _, we may use that too:

>>> df = pd.json_normalize(my_dict, sep='_')
>>> df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
>>> df
  version1                                                
      perl                           C                    
      line          func          line          func      
   covered total covered total covered total covered total
0      207   312      15    18     321   512      10    10

But still, as far as I understand, the first level of your dictionary is actually the row name. So let's stack the first column level and drop the old index (I'm creating a second entry for clarity):

>>> my_dict['version2'] = my_dict['version1']
>>> df = pd.json_normalize(my_dict, sep='_')
>>> df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
>>> df
  version1                                     version2                                    
      perl                     C                   perl                     C              
      line    func          line          func     line    func          line          func
     total covered total covered total covered    total covered total covered total covered
0      312      15    18     321   512      10      312      15    18     321   512      10
>>> df = df.stack(level=0).droplevel(0)

And you're good to go:

               C                  perl            
            func    line          func        line
         covered covered total covered total total
version1      10     321   512      15    18   312
version2      10     321   512      15    18   312

Tip: If your JSON doesn't have all levels occupied, i.e. the actual values are in different depths, you may consider using itertools.zip_longest:

from itertools import zip_longest

df = pd.json_normalize(my_dict)
tuple_cols = df.columns.str.split('.')
df.columns = pd.MultiIndex.from_arrays(zip_longest(*tuple_cols))