0

Each line is valid JSON, but I need the file as a whole to be valid JSON.

I have some data which is aggregated from a web service and dumped to a file, so it's JSON-eaque, but not valid JSON, so it can't be processed in the simple and intuitive way that JSON files can - thereby consituting a major pain in the neck, it looks (more or less) like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"} 

I've been trying to reinterpret it as valid JSON, my latest attempt looks like this:

with open('toy.json') as inpt:
    lines = []
    for line in inpt:
        if line.startswith('{'):  # block starts
            lines.append(line) 

However, as you can likely deduce by the fact that I'm posing this question- that doesn't work- any ideas about how I might tackle this problem?

EDIT:

Tried this:

with open('toy_two.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt] 

print(lines['record'])

but got the following error:

Traceback (most recent call last):
  File "json-ifier.py", line 38, in <module>
    print(lines['record'])
TypeError: list indices must be integers, not str

Ideally I'd like to interact with it as I can with normal JSON, i.e. data['value']

EDIT II

with open('transactions000000000029.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72

2 Answers2

2

Each line looks like a valid JSON document.

That's "JSON Lines" format (http://jsonlines.org/)

Try to process each line independantly (json.loads(line)) or use a specialized library (https://jsonlines.readthedocs.io/en/latest/).

def process(oneline):
    # do what you want with each line
    print(oneline['record'])

with open('toy_two.json', 'rb') as inpt:
    for line in inpt:
        process(json.loads(line))
Stephane Martin
  • 1,612
  • 1
  • 17
  • 25
2

This looks like NDJSON that I've been working with recently. The specification is here and I'm not sure of its usefulness. Does the following work?

with open('the file.json', 'rb') as infile:
    data = infile.readlines()
    data = [json.loads(item.replace('\n', '')) for item in data] 

This should give you a list of dictionaries.

roganjosh
  • 12,594
  • 4
  • 29
  • 46
  • when I tried it out just now I got this error `print(data['record']) TypeError: list indices must be integers, not str`, how can I verify that this works? – smatthewenglish Sep 16 '17 at 17:11
  • Because this parses the file and gives you a list of dictionaries, not a dictionary. – roganjosh Sep 16 '17 at 17:12
  • but I want to interact with it like I can with json, in normal json I can call things like `data['record']` you know what I mean? – smatthewenglish Sep 16 '17 at 17:13
  • @s.matthew.english You can still interact with it like you would normally. It's perfectly fine for a JSON response to contain lists. I really don't get the NDJSON format but it now exists, so it's a list of dicts. `data[0]['record']` should give you a result, and you should be able to iterate through the list to get the other results. – roganjosh Sep 16 '17 at 17:16
  • 1
    damn- I'm sorry it was exactly the `data[0]['record']`- thank you for your great help!~ :) – smatthewenglish Sep 16 '17 at 17:21
  • man- how can I iterate over all these reocrds? `items()` isn't working – smatthewenglish Sep 16 '17 at 17:26
  • 1
    @s.matthew.english it's still a list, so `items()` is out. `records = [item['record'] for item in data]` should do it? I guess the point of the format is that every line is valid json, but the file as a whole is not. I find this a bit uncomfortable too, but you do just have a list of dictionaries so if you know how to iterate through lists and grab things by key, it's not that bad. – roganjosh Sep 16 '17 at 17:28
  • so this isn't it ` for line in lines: records = [item['record'] for item in lines] print(records)` but... do you have some idea? – smatthewenglish Sep 16 '17 at 17:31
  • No, drop `for line in lines:`. Right under the code I posted, just do `records = [item['record'] for item in data]`. There's no point in `print` in that loop because I gave you a list comprehension. After the list comp, you could do `for item in records: print item` if you choose. – roganjosh Sep 16 '17 at 17:34
  • so, I made **EDIT II** in the OP, popped out the printing part- it works for the toy file, but for the million records file- it just never finishes- maybe it's breaking or... do you have some idea? – smatthewenglish Sep 16 '17 at 17:41
  • so yeah- it works on the format- but maybe it's just- excruciatingly slow- do you hve some idea on how to pump up the execution speed? – smatthewenglish Sep 16 '17 at 17:42
  • @s.matthew.english if you're talking about million of lines then maybe this format comes into its own. You can perhaps read it in chunks, which is tough for a flat json file. – roganjosh Sep 16 '17 at 17:56
  • @s.matthew.english get rid of `print` as that's massively expensive. Also, `for line in lines: ` makes no sense since you're working on list anyway. Get rid of it. – roganjosh Sep 16 '17 at 18:12