0

I have a big .json document that contains a json on each line:

{"_id": "60ddad", "type": ["test"], "company": ["60dd888"], "answers": [], "info": {}, "createdAt": "2021-07-01T11:57:08.492Z","__v": 0}
{"_id": "60deb", "type": ["test"], "company": ["60dea"], "answers": [], "info": {}, "createdAt": "2021-07-02T07:07:27.436Z","__v": 0, "sentence": {}, "text": {}}
{"_id": "60debb2", "type": ["exam"], "company": ["60dea"], "answers": ["option1"], "info": {}, "createdAt": "2021-07-02T07:07:27.451Z", "__v": 0, "sentence": {}, "text": {}}

I am trying to delete the empty struct types, such as "text": {}.

Is there any way of removing all the empty structs? A workaround would be to eliminate these certain keys that might contain empty structs, but it is possible that once in a while they contain a non-empty struct.

I was thinking of:

import json  

def empty_structs(dictionary):
    #do things

with open('C:\\my\\path\\file.json', 'r', encoding="utf8") as handle:
    data = handle.read()
    dicts = parse_ndjson(data)

for d in dicts:
    new_d = empty_structs(d)
    json_string=json.dumps(new_d, ensure_ascii=False)
    print(json_string)

Expected output:

{"_id": "60ddad", "type": ["test"], "company": ["60dd888"], "answers": [], "createdAt": "2021-07-01T11:57:08.492Z","__v": 0}
{"_id": "60deb", "type": ["test"], "company": ["60dea"], "answers": [], "createdAt": "2021-07-02T07:07:27.436Z","__v": 0}
{"_id": "60debb2", "type": ["exam"], "company": ["60dea"], "answers": ["option1"], "createdAt": "2021-07-02T07:07:27.451Z", "__v": 0}
johnnydoe
  • 382
  • 2
  • 12
  • did you use if "text": {} than delete. – Sarah Jul 06 '21 at 23:15
  • are you trying to eliminate strictly the nested objects/dicts, or would an empty list need to be removed as well? Please give us some examples for expected input/output. – David Culbreth Jul 06 '21 at 23:17
  • @DavidCulbreth I edited the question with the expected output based on the above input. I would eliminate just the nested objects/dicts. The actual reason why I wanna do this is because I want to upload the file to BigQuery and it doesn't support empty structs, so I guess empty lists aren't a problem. – johnnydoe Jul 06 '21 at 23:30
  • @Sarah I assume this is the solution in case I want to name the exact keys I want to delete, and I just iterate through every line and check if they are there? – johnnydoe Jul 06 '21 at 23:31
  • @johnnydoe i don't know much about python but i think you can get answer from here https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings – Sarah Jul 06 '21 at 23:36

2 Answers2

1

Try this:

def empty_structs(d):
    return {k:v for k,v in d.items() if v}

This will exclude false-y values like 0 too, so adjust as desired.

dkamins
  • 21,450
  • 7
  • 55
  • 59
  • 1
    This will also eliminate the `'__v': 0` entries, which are wanted – Jiří Baum Jul 06 '21 at 23:46
  • 1
    @sabik In this case I don't mind it. The code given by OP works on the small sample that I gave in the original post, but it throws an error on the big file: JSONDecodeError: Unterminated string starting at: line 1 column 260 (char 259). (I tried deleting the first line and I get the exact same error and numbers). But the error happens in the json.loads line in my code. – johnnydoe Jul 07 '21 at 00:00
  • 1
    Sounds like the file itself has a problem; check it in a text editor, with particular attention to where the quotes start and end? – Jiří Baum Jul 07 '21 at 00:16
1

Try:

def empty_structs(d):
    return {k:v for k,v in d.items() if v != {}}

Note: an alternative approach would be to delete the entries directly in the original dict; however, this would have to be done in two loops, to avoid modifying it while iterating:

    to_remove = [k for k,v in d.items() if v == {}]
    for k in to_remove:
        del d[k]
Jiří Baum
  • 6,697
  • 2
  • 17
  • 17