0

I have just started to learn Python and I have a task of converting a JSON to a CSV file as semicolon as the delimiter and with three constraints. My JSON is:

{"_id": "5cfffc2dd866fc32fcfe9fcc", 
"tuple5": ["system1/folder", "system3/folder"], 
"tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"], 
"tuple3": ["system2/folder/text2.txt"], 
"tuple2": ["system2/folder"], 
"tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"], 
"tupleSize": 3}

The output CSV should be in a form:

system1                  ;           system2        ;             system3
system1/folder           ;             ~            ;            system3/folder
system1/folder/text3.txt ; system2/folder/text3.txt ;              ~
~                        ; system2/folder/text2.txt ;              ~
~                        ; system2/folder           ;              ~
system1/folder/text1.txt ; system2/folder/text1.txt ;              ~

So the three constraints are that the tupleSize will indicate the number of rows, the first part of the array elements i.e., sys1, sys2 and sys3 will be the array elements and finally only those elements belonging to a particular system will have the values in the CSV file (rest is ~).

I found a few posts regarding the conversion in Python like this and this. None of them had any constraints any way related to these and I am unable to figure out how to approach this.

Can someone help?

EDIT: I should mention that the array elements are dynamic and thus the row headers may vary in the CSV file.

martineau
  • 119,623
  • 25
  • 170
  • 301
hw-135
  • 162
  • 1
  • 15

1 Answers1

0

What you want to do is fairly substantial, so if it's just a Python learning exercise, I suggest you begin with more elementary tasks.

I also think you've got what most folks call rows and columns reversed — so be warned that everything below, including the code, is using them in the opposite sense to the way you used them in your question.

Anyway, the code below first preprocesses the data to determine what the columns or fieldnames of the CSV file are going to be and to make sure there are the right number of them as specified by the 'tupleSize' key.

Assuming that constraint is met, it then iterates through the data a second time and extracts the column/field values from each key value, putting them into a dictionary whose contents represents a row to be written to the output file — and then does that when finished.

Updated

Modified to remove all keys that start with "_id" in the JSON object dictionary.

import csv
import json
import re


SEP = '/'  # Value sub-component separator.
id_regex = re.compile(r"_id\d*")
json_string = '''
    {"_id1": "5cfffc2dd866fc32fcfe9fc1",
     "_id2": "5cfffc2dd866fc32fcfe9fc2",
     "_id3": "5cfffc2dd866fc32fcfe9fc3",
     "tuple5": ["system1/folder", "system3/folder"],
     "tuple4": ["system1/folder/text3.txt", "system2/folder/text3.txt"],
     "tuple3": ["system2/folder/text2.txt"],
     "tuple2": ["system2/folder"],
     "tuple1": ["system1/folder/text1.txt", "system2/folder/text1.txt"],
     "tupleSize": 3}
'''

data = json.loads(json_string)  # Convert JSON string into a dictionary.

# Remove non-path items from dictionary.
tupleSize = data.pop('tupleSize')
_ids = {key: data.pop(key)
            for key in tuple(data.keys()) if id_regex.search(key)}
#print(f'_ids: {_ids}')
max_columns = int(tupleSize)  # Use to check a contraint.

# Determine how many columns are present and what they are.
columns = set()
for key in data:
    paths = data[key]
    if not paths:
        raise RuntimeError('key with no paths')
    for path in paths:
        comps = path.split(SEP)
        if len(comps) < 2:
            raise RuntimeError('component with no subcomponents')
        columns.add(comps[0])

    if len(columns) > max_columns:
        raise RuntimeError('too many columns - conversion aborted')

# Create CSV file.
with open('converted_json.csv', 'w', newline='') as file:
    writer = csv.DictWriter(file, delimiter=';', restval='~',
                            fieldnames=sorted(columns))
    writer.writeheader()

    for key in data:
        row = {}
        for path in data[key]:
            column, *_ = path.split(SEP, maxsplit=1)
            row[column] = path
        writer.writerow(row)

print('Conversion complete')
martineau
  • 119,623
  • 25
  • 170
  • 301
  • I was trying it run it with some other examples hence the delay in accepting the answer. This is exactly what I needed! – hw-135 Jun 13 '19 at 12:47
  • I am trying to understand each line of the code properly. I don't understand why both the question and the answer got downvoted though. – hw-135 Jun 13 '19 at 12:48
  • jackw: That's good to hear — and is what I hoped would be the case. When you earn enough reputation points to up-vote answers, I'd appreciate you doing that to this one. – martineau Jun 15 '19 at 14:47
  • Of course. I was wondering, is there a way to make the excluded keys dynamic in nature if my `_id` is dynamic in the sense there can be keys `_id1`, `_id2` and so on? – hw-135 Jun 17 '19 at 19:04
  • jackw: Yes, something like that is possible. It would require replacing the `_id, tupleSize = (data.pop(key) for key in ('_id', 'tupleSize'))` line with code that popped keys matching the pattern. – martineau Jun 17 '19 at 19:23
  • I was using the previous code and basically what I did was: `for key,value in data.items(): if key.startswith('_id'): excluded_keys.add(key)` This seems to give the desired output. – hw-135 Jun 18 '19 at 12:21
  • 1
    jackw: That looks basically all right, although you could simply use `data.keys()` instead of `data.items()` since you don't care about the associated values. I have also updated my answer to show how to do it with regular expressions although that is probably overkill for matching a pattern this simple. – martineau Jun 18 '19 at 12:40