0

I've cobbled together a few disparate excel/csv files with pandas for a database I am trying to build. I've seen a few examples here of creating nested Jsons from csvs, and while those have helped to partially replicate what I need, they ultimately have fallen short.

Rather that being flat, my data is stepwise like this where data is 'joined' by a subject id #, but information on individual visits and samples are on separate rows with 'NaN' for unrelated columns.

in csv format:

subject_id,name,dob,gender,visit_date,date_entered,entered_by,sample_id,collected_by,collection_date
1,Bob,1/1/00,M,,,,,,
1,,,,1/1/18,1/2/18,Sally,,,
1,,,,1/2/18,1/2/18,Tim,,,
1,,,,,,,XXX123,Sally,1/3/18
2,Mary,1/2/00,F,,,,,,
2,,,,1/3/18,1/4/18,Sally,,,
2,,,,,,,YYY456,Sally,1/5/18
2,,,,,,,ZZZ789,Tim,1/6/18

I'm trying to get an output like this:

{
'subject_id': '1'
'name': 'Bob',
'dob': '1/1/00',
'gender': 'M',
'visits': { 
    '1/1/18': {
        'date_entered': '1/2/18',
        'entered_by': 'Sally',
        }
    '1/2/18': {
        'date_entered': '1/2/18',
        'entered_by': 'Tim',
        }
    }
'samples': {
    'XXX123': {
        'collected_by': 'Sally',
        'collection_date': '1/3/18',
        }
    }
}
{
'subject_id': '2'
'name': 'Mary',
'dob': '1/2/00',
'gender': 'F',
'visits': { 
    '1/3/18': {
        'date_entered': '1/4/18',
        'entered_by': 'Sally',
        }
    }
'samples': {
    'YYY456': {
        'collected_by': 'Sally',
        'collection_date': '1/5/18',
        }
    'ZZZ789': {
        'collected_by': 'Tim',
        'collection_date': '1/6/18',
        }   
    }
}

Where information on visits and samples are nested under the more general information. This is obviously a simplified dataset of what I am trying to accomplish, but any advice would be greatly appreciated.

Thanks.

EDIT: More accurate reflection of csv data. Not as streamlined or complete as original example.

'subid,firstvisit,name,contact,dob,gender,visitdate1,age,visitcategory,samplenumber,label_on_sample,completed_by
    1,12/31/11,Bob,,12/31/00,Male,,,,,,
    1,,,,,,12/31/15,17,Baseline Visit,,,
    1,,,,,,12/31/16,18,Follow Up Visit,,,
    1,,,,,,12/31/17,18,Follow Up Visit,,,
    1,,,,12/31/00,Male,,17,,XXX123,1,Sally
    2,1/1/12,,,1/1/01,Female,,,,,,
    2,,,,,,1/1/11,10,Baseline Visit,,,
    2,,,,,,1/1/12,11,Follow Up Visit,,,
    2,,,,,,1/1/13,12,Follow Up Visit,,,
    2,,,,,,1/1/14,13,Follow Up Visit,,,
    2,,,,,,1/1/15,14,Follow Up Visit,,,
    2,,,,1/1/01,Female,,15,,YYY456,2,
    2,,,,1/1/01,Female,,15,,ZZZ789,2,Sally'

1 Answers1

0

Although I would guess the pandas wizards on SO have a different way, here is one way to achieve your example output with pure Python (I wrote this using Python 3.6.5).

Hopefully this can help you get started!


EDIT:

I modified the code to hopefully account for the new example csv data provided. Since the structure of the new csv is not exactly the same, I had to guess a bit as to the final output structure.

from collections import defaultdict
from csv import DictReader


def solution(csv_filename):
    by_subject_id = defaultdict(lambda: {
        'name': None,
        'dob': None,
        'gender': None,
        'visits': {},
        'samples': {}
    })

    with open(csv_filename) as f:
        dict_reader = DictReader(f)
        for row in dict_reader:
            non_empty = {k: v for k, v in row.items() if v}
            subject_id = non_empty['subid']  # must have to group by
            first_visit = non_empty.get('firstvisit')  # optional
            sample = non_empty.get('samplenumber')  # optional
            visit = non_empty.get('visitdate1')  # optional

            if first_visit:
                by_subject_id[subject_id].update({
                    'name': non_empty.get('name'),
                    'dob': non_empty.get('dob'),
                    'gender': non_empty.get('gender')
                })
            elif visit:
                by_subject_id[subject_id]['visits'][visit] = {
                    'age': non_empty.get('age'),
                    'visit_category': non_empty.get('visitcategory')
                }
            elif sample:
                by_subject_id[subject_id]['samples'][sample] = {
                    'completed_by': non_empty.get('completed_by'),
                    'label_on_sample': non_empty.get('label_on_sample')
                }
    return [{'subject_id': k, **v} for k, v in by_subject_id.items()]

Output:

[
    {
        "subject_id": "1",
        "name": "Bob",
        "dob": "12/31/00",
        "gender": "Male",
        "visits": {
            "12/31/15": {
                "age": "17",
                "visit_category": "Baseline Visit"
            },
            "12/31/16": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            },
            "12/31/17": {
                "age": "18",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "XXX123": {
                "completed_by": "Sally",
                "label_on_sample": "1"
            }
        }
    },
    {
        "subject_id": "2",
        "name": null,
        "dob": "1/1/01",
        "gender": "Female",
        "visits": {
            "1/1/11": {
                "age": "10",
                "visit_category": "Baseline Visit"
            },
            "1/1/12": {
                "age": "11",
                "visit_category": "Follow Up Visit"
            },
            "1/1/13": {
                "age": "12",
                "visit_category": "Follow Up Visit"
            },
            "1/1/14": {
                "age": "13",
                "visit_category": "Follow Up Visit"
            },
            "1/1/15": {
                "age": "14",
                "visit_category": "Follow Up Visit"
            }
        },
        "samples": {
            "YYY456": {
                "completed_by": null,
                "label_on_sample": "2"
            },
            "ZZZ789": {
                "completed_by": "Sally",
                "label_on_sample": "2"
            }
        }
    }
]
G_M
  • 3,342
  • 1
  • 9
  • 23
  • Thanks so much. This works perfectly on the test data. Unfortunately, I am running into issues on the real set, largely, I believe, because of unexpected NaN's. For instance, Bob's 'dob' might be missing or a 'date_entered', and it seems to be throwing everything off. – jester_in_yellow May 24 '18 at 19:49
  • added a more accurate reflection of the data. Apologies for the inconvenience. – jester_in_yellow May 24 '18 at 20:31
  • 1
    This works like a charm thank you so much! I wanted to let you know you've helped a lot of people in a very meaningful way with this. – jester_in_yellow May 25 '18 at 16:53
  • Thanks again for all your help, been busy on trying to expand this to the larger dataset. One thing I noticed, is that it seems that this returns everything as a string. If I wanted ages, say, to be an int, is there some way to account for this? – jester_in_yellow Jun 01 '18 at 19:35
  • Ah, it seems to be an issue with reading in from the csv, it reads everything as a string. You wouldn't happen to know a good way to convert this to read from a dataframe would you? – jester_in_yellow Jun 01 '18 at 19:52
  • @jester_in_yellow `"If I wanted ages, say, to be an int, is there some way to account for this?"` https://docs.python.org/3/library/functions.html#int ? – G_M Jun 02 '18 at 00:03
  • I am running into issues querying due to the variable field names of the visits and the samples. I have been playing around with trying to correct for this, but haven't been able to find anything that doesn't breaks my code or only records one of the the instances for sample or visit. Do you know a way to modify this to record the subdocuments without a variable field name? – jester_in_yellow Jul 05 '18 at 14:59