create valid json object in python

Question

Each line is valid JSON, but I need the file as a whole to be valid JSON.

I have some data which is aggregated from a web service and dumped to a file, so it's JSON-eaque, but not valid JSON, so it can't be processed in the simple and intuitive way that JSON files can - thereby consituting a major pain in the neck, it looks (more or less) like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"}

I've been trying to reinterpret it as valid JSON, my latest attempt looks like this:

with open('toy.json') as inpt:
    lines = []
    for line in inpt:
        if line.startswith('{'):  # block starts
            lines.append(line)

However, as you can likely deduce by the fact that I'm posing this question- that doesn't work- any ideas about how I might tackle this problem?

EDIT:

Tried this:

with open('toy_two.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt] 

print(lines['record'])

but got the following error:

Traceback (most recent call last):
  File "json-ifier.py", line 38, in <module>
    print(lines['record'])
TypeError: list indices must be integers, not str

Ideally I'd like to interact with it as I can with normal JSON, i.e. data['value']

EDIT II

with open('transactions000000000029.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item

Is each line valid JSON? eg: does `lines = [json.loads(line) for line in inpt]` do the job? — Jon Clements, Sep 16 '17 at 17:00
yes but I don't want to process each line- I want to process the file as a whole- the real one has millions of records — smatthewenglish, Sep 16 '17 at 17:04
In what way does `[json.loads(line) for line in inpt]` not constitute "processing the file as a whole" ? — Chris Martin, Sep 16 '17 at 17:08
@ChrisMartin when I gave it a shot I got this `print(lines['record']) TypeError: list indices must be integers, not str` — smatthewenglish, Sep 16 '17 at 17:09
I'm quite confused now. If this file *were* valid JSON, it would be a list, right? What type do you want to interpret it as? — Chris Martin, Sep 16 '17 at 17:10
I doubt you would want it to be JSON; it would consume gigabytes of RAM and had to be processed all at once, if some kind of iterative JSON module was not used... — Antti Haapala -- Слава Україні, Sep 16 '17 at 17:17
Does this answer your question? [multiple Json objects in one file extract by python](https://stackoverflow.com/questions/27907633/multiple-json-objects-in-one-file-extract-by-python) — TAbdiukov, Dec 11 '19 at 14:39

Stephane Martin · Answer 1 · 2017-09-16T17:23:36.860

2

Each line looks like a valid JSON document.

That's "JSON Lines" format (http://jsonlines.org/)

Try to process each line independantly (json.loads(line)) or use a specialized library (https://jsonlines.readthedocs.io/en/latest/).

def process(oneline):
    # do what you want with each line
    print(oneline['record'])

with open('toy_two.json', 'rb') as inpt:
    for line in inpt:
        process(json.loads(line))

edited Sep 16 '17 at 17:23

answered Sep 16 '17 at 17:03

Stephane Martin

1,612
1
17
25

I'd like to process the file as a whole- as the real one has millions of records – smatthewenglish Sep 16 '17 at 17:05
So ? You can just iterate on each line of the input file as you do in your code, and apply json.loads(line) inside the 'for' loop. – Stephane Martin Sep 16 '17 at 17:10
sounds expensive, I want to do it cheap and fast – smatthewenglish Sep 16 '17 at 17:12
If you store all parsed lines in a global list, yes this is going to be expensive in RAM. If you process each line independantly, then you only use a bit of memory for the current line. That's "flow based programming". – Stephane Martin Sep 16 '17 at 17:17
ok cool- it was just the `data[0]['record']` issue- anyway- thank you for these great insights! – smatthewenglish Sep 16 '17 at 17:21

roganjosh · Accepted Answer · 2017-09-16T18:01:46.613

2

This looks like NDJSON that I've been working with recently. The specification is here and I'm not sure of its usefulness. Does the following work?

with open('the file.json', 'rb') as infile:
    data = infile.readlines()
    data = [json.loads(item.replace('\n', '')) for item in data]

This should give you a list of dictionaries.

edited Sep 16 '17 at 18:01

answered Sep 16 '17 at 17:08

roganjosh

12,594
4
29
46

when I tried it out just now I got this error `print(data['record']) TypeError: list indices must be integers, not str`, how can I verify that this works? – smatthewenglish Sep 16 '17 at 17:11
Because this parses the file and gives you a list of dictionaries, not a dictionary. – roganjosh Sep 16 '17 at 17:12
but I want to interact with it like I can with json, in normal json I can call things like `data['record']` you know what I mean? – smatthewenglish Sep 16 '17 at 17:13
@s.matthew.english You can still interact with it like you would normally. It's perfectly fine for a JSON response to contain lists. I really don't get the NDJSON format but it now exists, so it's a list of dicts. `data[0]['record']` should give you a result, and you should be able to iterate through the list to get the other results. – roganjosh Sep 16 '17 at 17:16
1

damn- I'm sorry it was exactly the `data[0]['record']`- thank you for your great help!~ :) – smatthewenglish Sep 16 '17 at 17:21
man- how can I iterate over all these reocrds? `items()` isn't working – smatthewenglish Sep 16 '17 at 17:26
1

@s.matthew.english it's still a list, so `items()` is out. `records = [item['record'] for item in data]` should do it? I guess the point of the format is that every line is valid json, but the file as a whole is not. I find this a bit uncomfortable too, but you do just have a list of dictionaries so if you know how to iterate through lists and grab things by key, it's not that bad. – roganjosh Sep 16 '17 at 17:28
so this isn't it ` for line in lines: records = [item['record'] for item in lines] print(records)` but... do you have some idea? – smatthewenglish Sep 16 '17 at 17:31
No, drop `for line in lines:`. Right under the code I posted, just do `records = [item['record'] for item in data]`. There's no point in `print` in that loop because I gave you a list comprehension. After the list comp, you could do `for item in records: print item` if you choose. – roganjosh Sep 16 '17 at 17:34
so, I made **EDIT II** in the OP, popped out the printing part- it works for the toy file, but for the million records file- it just never finishes- maybe it's breaking or... do you have some idea? – smatthewenglish Sep 16 '17 at 17:41
so yeah- it works on the format- but maybe it's just- excruciatingly slow- do you hve some idea on how to pump up the execution speed? – smatthewenglish Sep 16 '17 at 17:42
@s.matthew.english if you're talking about million of lines then maybe this format comes into its own. You can perhaps read it in chunks, which is tough for a flat json file. – roganjosh Sep 16 '17 at 17:56
@s.matthew.english get rid of `print` as that's massively expensive. Also, `for line in lines: ` makes no sense since you're working on list anyway. Get rid of it. – roganjosh Sep 16 '17 at 18:12

create valid json object in python

2 Answers2

Linked