Loading CSV into nested dictionary to pull out dynamic values

Question

I get a CSV from our developers that looks like this (example values only)

Category, Sub-Category, Template Name
Health Check,CPU,checkCPU
Health Check,Memory,checkMemory
Service Request,Reboot Device, rebootDevice
Service Request,Check CPU,checkCPU-SR

I need my python script to be able to read this (values will grow and change over time) and give me the Template Name for a given Category and Sub-Category. I can make this work by reading in the CSV and looping through it, searching for the values I want, but it seems like there has to be an easier way.

If I load a JSON file like this, I can use json.load to convert things to a dict, then easily retrieve the value I want without having to loop through things.

{
    "Health Check": {
        "CPU": "checkCPU",
        "Memory": "checkMemory"
    },
    "Service Request": {
        "Reboot Device": "rebootDevice",
        "Check CPU": "checkCPU-SR"
    }
}

Then I use something like

import json
import csv

with open('categories.json','r') as f:
    myDict = json.load(f)
    print(myDict["Health Check"]["CPU"])

I'd much rather use the dict method, but I don't know if there's a way to achieve this from a CSV file. I've tried a few things like csv.dictreader or Pandas, but I can't get either of them to work. Pandas I could setup a key, but none of the values here are unique. With csv.dictreader, the nested fashion of this data (multiple keys/values under a single heading like Health Check) don't seem to work.

Checkout https://stackoverflow.com/questions/43757965/convert-csv-to-json-tree-structure — Neo, Jul 20 '22 at 16:52
*"I get a CSV from our developers that looks like this"* - I would probably start by scheduling a working session with the aforementioned devs, just a quick chat really, trying to learn more about their thought process and also about their work experience with CSV files before, and hint if perhaps they can provide the same data in JSON format as well. — rv.kvetch, Jul 20 '22 at 16:57
what is wrong with using loop for this? Frankly I would use loop for this because it doesn't need Pandas. — furas, Jul 21 '22 at 08:54
if you could keep it as Pandas then you could get value without using `JSON` like `df[ (df["Category"] == "Health Check") & (df["Sub-Category"] == "CPU") ]["Template Name"]` — furas, Jul 21 '22 at 10:35

furas · Answer 1 · 2022-07-22T08:56:34.167

If you use pandas then you don't need loop and you don't have to convert to JSON.

all_results = df[ (df["Category"] == "Health Check") & (df["Sub-Category"] == "CPU") ]["Template Name"]

or more readable

mask = (df["Category"] == "Health Check") & (df["Sub-Category"] == "CPU")

all_results = df[mask]["Template Name"]

Miniamal working code.

I use io only to simulate file - so everyone can simply copy and test it - but you should use filename

text = '''Category, Sub-Category, Template Name
Health Check,CPU,checkCPU
Health Check,Memory,checkMemory
Service Request,Reboot Device, rebootDevice
Service Request,Check CPU,checkCPU-SR'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep=',\s*')  # I use `,\s*` to remove spaces after `,`
print(df)
print('---')

mask = (df["Category"] == "Health Check") & (df["Sub-Category"] == "CPU")

all_results = df[mask]["Template Name"]
print(all_results.iloc[0])

Result:

          Category   Sub-Category Template Name
0     Health Check            CPU      checkCPU
1     Health Check         Memory   checkMemory
2  Service Request  Reboot Device  rebootDevice
3  Service Request      Check CPU   checkCPU-SR
---
checkCPU

Using pandas you can easily select all items which can match some values - ie. all items which have substring CPU in Sub-Category

mask = df["Sub-Category"].str.contains("CPU")

all_results = df[mask]

for index, item in all_results.iterrows():
    print(item['Category'], '|', item["Sub-Category"], '|', item["Template Name"])

Result:

Health Check | CPU | checkCPU
Service Request | Check CPU | checkCPU-SR

Thanks for taking the time. I had tried looking into pandas, but I never heard of the mask feature. From what I had read, I thought you had to have unique values for your key, which I can't guarantee. It seems like your implementation doesn't have such a requirement though, right? Also, as it seems I failed to state it originally, this CSV will be 100+ lines, I only gave a few for sample purposes. I assume that won't cause any issues here? — ChadDa3mon, Jul 21 '22 at 21:11
frankly, pandas uses masks all time but nobody names it `mask` because in all examples you can see something like in first code - all in one line. With table you can have the same keys many times. If you will convert to JSON (dictionary) then you would have to unique keys - or if you would have the same key few times then you would get value only for last one. but pandas has also functions to select `unque` values or you can always use `[-1]` (instead of `[0]` to get last matching element. I don't have your data so you will have to test it on your own and create new question if you get problem — furas, Jul 22 '22 at 08:54
small correction: it needs `.iloc[0]` (or `.iloc[-1]`) instead of `[0]` in line `print(all_results[0])`. I changed it in code — furas, Jul 22 '22 at 08:57
to remove duplicated values [drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) — furas, Jul 22 '22 at 09:03

Michael · Answer 2 · 2022-07-20T17:57:08.750

import json


def convert():
    output_dict = {}
    with open("<file_name>", "r") as source:
        file_lines = source.read().splitlines()
        file_lines.pop(0)  # Remove header line as we don't need it
    for line in file_lines:
        line_contents = line.split(",")  # CSV so split the line on ','
        if line_contents[0] not in output_dict:  # If first key isn't in output_dict, make it as a dict
            output_dict[line_contents[0]] = {}
        if line_contents[1] not in output_dict[line_contents[0]]:  # If second key isn't a sub-key of the first-key in output_dict,  make it a dict
            output_dict[line_contents[0]][line_contents[1]] = {}
        output_dict[line_contents[0]][line_contents[1]][line_contents[2]] = line_contents[3]  # Now add the entry based on the top two key and sub-key

    return output_dict  # return the dict


if __name__ == "__main__":
    print(json.dumps(convert(), indent=4))

Loading CSV into nested dictionary to pull out dynamic values

2 Answers2