0

Assume I have a DataFrame df like:

source      tables      columns   data_type   length    RecordCount
src1        table1      col1      INT         4         71
src1        table1      col2      CHAR        2         71
src1        table2      col1      CHAR        2         43
src2        table1      col1      INT         4         21
src2        table1      col2      DATE        3         21

Need an output that looks similar to:

{
  "src1": {
    "table1": {
      "Record Count": 71 #missing in my current code output
      "col1": {
        "type": "INT"
        "length": 4
      },
      "col2": {
        "type": "CHAR"
        "length": 2
      }
    },
    "table2": {
      "Record Count": 43 #missing in my current code output
      "col1": {
        "type": "CHAR"
        "length": 2
      }
    }
  },
  "src2": {
    "table1": {
      "Record Count": 21 #missing in my current code output
      "col1": {
        "type": "INT"
        "length": 4
      },
      "col2": {
        "type": "DATE"
        "length": 3
      }
    }
  }
}

Current code:

def make_nested(df): 
    f = lambda: defaultdict(f)   
    data = f()  

    for row in df.to_numpy().tolist():
        t = data
        for index, r in enumerate(row[:-4]):
            t = t[r]
            if index == 1:
               t[row[-5]]: {
                  "Record Count": row[-1]
               }
        t[row[-4]] = {
            "type": row[-3],
            "length": row[-2]
        }

    return data
  • `for index, r in enumerate(row[:-4]):` is supposed to replace `for r in row[:-4]:`, not nest one inside the other. – cs95 Mar 02 '20 at 18:51
  • Made an edit to the code, looks like I'm getting the same original output without the new Record Count information added in the JSON file – weovibewvoibweoivwoiv Mar 02 '20 at 19:00

1 Answers1

1

Here is another solution use two steps of groupby method.

# First, groupby ['source','tables'] to deal with the annoying 'Record Count'
# Need python 3.5+
# Otherwise, another method to merge two dicts should be used 
df_new=df.groupby(['source','tables']).apply(lambda x: {**{'Record Count':x.iloc[0,-1]}, **{x.iloc[i,-4]: {'type':x.iloc[i,-3],'length':x.iloc[i,-2]} for i in range(len(x))}}).reset_index()

See Merge dicts

After the first step, the df_new is look like

    source  tables  0
0   src1    table1  {'Record Count': 71, 'col1': {'type': 'INT', 'length': 4}, 'col2': {'type': 'CHAR', 'length': 2}}
1   src1    table2  {'Record Count': 43, 'col1': {'type': 'CHAR', 'length': 2}}
2   src2    table1  {'Record Count': 21, 'col1': {'type': 'INT', 'length': 4}, 'col2': {'type': 'DATE', 'length': 3}}
# Second groupby
df_final = df_new.groupby('source').apply(lambda x: {x.iloc[i,-2]: x.iloc[i,-1] for i in range(len(x))})
output = df_final.to_json()

The output is an encoded string type of json file. To get the indented version

import json
temp = json.loads(output)
with open('somefile','w') as f:
    json.dump(temp,f,indent=4)
CameLion
  • 106
  • 3
  • Thanks, this works, however, when I dump this information to a file, it all appears in 1 line instead of like the intended spaced output I've shown in my post. How can I fix the code to allow for that? – weovibewvoibweoivwoiv Mar 02 '20 at 20:25
  • There's also slashes before every ```"``` which I'll need to get rid of – weovibewvoibweoivwoiv Mar 02 '20 at 20:26
  • 1
    @weovibewvoibweoivwoiv Add some content to change the format. Also, to directly fix your current code, try change `t[row[-5]]: {"Record Count": row[-1]}` to `t["Record Count"] = row[-1]` – CameLion Mar 03 '20 at 02:03