1

I am a newbie to json and I tried what has been proposed here. But I failed.

My original file (abbreviated) is called test.csv and looks like this:

person_uuid sample_uuid sample_slot sample_info
aa  AB  A   anything
aa  BD  B   more info
bc  FD  A   just info 
bc  AD  B   even more info 
bc  OI  C   text
hu  KL  B   texttext
hu  HF  C   information

The script I try to convert it with is called csv2json.py:

import csv
import json
import sys

base_name = sys.argv[1]
csvFilePath = "data/"+base_name+".csv"
jsonFilePath = "data/"+base_name+".json"

# https://stackoverflow.com/a/53474378/8584652
primary_fields = ['person_uuid']
secondary_fields = ['sample_slot']
result = []
with open(csvFilePath) as csv_file:
    reader = csv.DictReader(csv_file, delimiter='\t', skipinitialspace=True)
    for row in reader:
        d = {k: v for k, v in row.items() if k in primary_fields}
        e = {k: v for k, v in row.items() if k in secondary_fields}

        d['samples'] = [{k: v, }
                        for k, v in row.items() if k not in primary_fields]

        result.append(d)

# convert python jsonArray to JSON String and write to file
with open(jsonFilePath, 'w', encoding='utf-8') as jsonf:
    jsonString = json.dumps(result, indent=4)
    jsonf.write(jsonString)

I envoke the conversion with python csv2json.py test and I get this as result:

[
    {
        "person_uuid": "aa",
        "samples": [
            {
                "sample_uuid": "AB"
            },
            {
                "sample_slot": "A"
            },
            {
                "sample_info": "anything"
            }
        ]
    },
    {
        "person_uuid": "aa",
        "samples": [
            {
                "sample_uuid": "BD"
            },
            {
                "sample_slot": "B"
            },
            {
                "sample_info": "more info"
            }
        ]
    },
    {
        "person_uuid": "bc",
        "samples": [
            {
                "sample_uuid": "FD"
            },
            {
                "sample_slot": "A"
            },
            {
                "sample_info": "just info "
            }
        ]
    },
    {
        "person_uuid": "bc",
        "samples": [
            {
                "sample_uuid": "AD"
            },
            {
                "sample_slot": "B"
            },
            {
                "sample_info": "even more info "
            }
        ]
    },
    {
        "person_uuid": "bc",
        "samples": [
            {
                "sample_uuid": "OI"
            },
            {
                "sample_slot": "C"
            },
            {
                "sample_info": "text"
            }
        ]
    },
    {
        "person_uuid": "hu",
        "samples": [
            {
                "sample_uuid": "KL"
            },
            {
                "sample_slot": "B"
            },
            {
                "sample_info": "texttext"
            }
        ]
    },
    {
        "person_uuid": "hu",
        "samples": [
            {
                "sample_uuid": "HF"
            },
            {
                "sample_slot": "C"
            },
            {
                "sample_info": "information"
            }
        ]
    }
]

But I would like to get instead:

[


    {
        "person_uuid": "aa",
        "samples": {
            "A": {
                "sample_uuid": "AB",
                "sample_info": "anything"
            },
            "B": {
                "sample_uuid": "BD",
                "sample_info": "more info"
            }
        }


    }, {
        "person_uuid": "bc",
        "samples": {
            "A": {
                "sample_uuid": "FD",
                "sample_info": "just info"
            },
            "B": {
                "sample_uuid": "AD",
                "sample_info": "even more info"
            },
            "C": {
                "sample_uuid": "OI",
                "sample_info": "text"
            }
        }
    },
    {
        "person_uuid": "hu",
        "samples": {
            "B": {
                "sample_uuid": "KL",
                "sample_info": "texttext"
            },
            "C": {
                "sample_uuid": "HF",
                "sample_info": "information"
            }
        }
    }

]

Any help appreciated how I can nest properly (what I tried with e = {k: v for k, v in row.items() if k in secondary_fields}).

lukascbossert
  • 375
  • 1
  • 11

1 Answers1

1

Can be solved with iterools.groupby (also see this awswer).

Here an example:

from itertools import groupby


primary_fields = "person_uuid"
secondary_fields = "sample_slot"

with open(csvFilePath) as csv_file:
    reader = csv.DictReader(csv_file, delimiter='\t', skipinitialspace=True)
    result = []
    # We group them by all those who have the same primary_fields
    for key, group in groupby(reader, key=lambda x: x[primary_fields]):
        # We do the "sample" only for the filtered items
        samples = {
            elem[secondary_fields]: {
            "sample_uuid": elem["sample_uuid"],
            "sample_info": elem["sample_info"],
            }
            for elem in group
        }
        result.append({primary_fields: key, "samples": samples})

And result it's:

[{'person_uuid': 'aa',
  'samples': {'A': {'sample_uuid': 'AB', 'sample_info': 'anything'},
   'B': {'sample_uuid': 'BD', 'sample_info': 'more info'}}},
 {'person_uuid': 'bc',
  'samples': {'A': {'sample_uuid': 'FD', 'sample_info': 'just info '},
   'B': {'sample_uuid': 'AD', 'sample_info': 'even more info '},
   'C': {'sample_uuid': 'OI', 'sample_info': 'text'}}},
 {'person_uuid': 'hu',
  'samples': {'B': {'sample_uuid': 'KL', 'sample_info': 'texttext'},
   'C': {'sample_uuid': 'HF', 'sample_info': 'information'}}}]
Lucas
  • 6,869
  • 5
  • 29
  • 44
  • Thank you a lot. A quick question. Where should I implement a sort command for the field `sample_slot`? Otherwise I can do it with an additional command `python -m json.tool --sort-keys`. – lukascbossert May 05 '21 at 14:44
  • 1
    Check the doc of [json.dumps](https://docs.python.org/3/library/json.html#json.dumps), see `sort_keys` param. Something like: `json.dumps(result, indent=4, sort_keys=True)` – Lucas May 05 '21 at 14:57
  • Also this [answer](https://stackoverflow.com/a/2774404/3954379) – Lucas May 05 '21 at 14:58