Speed up execution time of JSON-ification and processing of data with Python

Question

My script looks like this:

with open('toy.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item

What it does is read in data where each line is valid JSON, but the file as a whole is not valid JSON. The reason for that is because it's an aggregated dump from a web service.

The data looks, more or less, like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"}

So the code above works, it allows me to interact with the data as JSON, but it's so slow that it's essentially useless.

Is there a good way to speed up this process?

EDIT:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print("identifier: "+json.loads(line)['identifier'])
        print("value:  "+json.loads(line)['value'])

EDIT II:

for line in inpt:
    resource = json.loads(line)
    print(resource['identifier']+", "+resource['value'])

Why do you construct `records = [item['hash'] for item in lines]` for each line? — Willem Van Onsem, Sep 16 '17 at 17:50
so I can access the item by it's JSON identifier and also so I can iterate over the whole file — smatthewenglish, Sep 16 '17 at 17:51
but you never use `line` in the list comprehension, an in the list comprehension, you iterate already over `lines` again. — Willem Van Onsem, Sep 16 '17 at 17:53

score 2 · Accepted Answer · answered Sep 16 '17 at 17:52

2

You write:

for line in lines: 
    records = [item['hash'] for item in lines]

But this means that you will construct that records list n times (with n the number of lines). This is useless, and makes the time complexity O(n²).

You can speed this up with:

with open('toy.json', 'rb') as inpt:

    for item in [json.loads(line)['hash'] for line in inpt]:
        print item

Or you can reduce the memory burden, by each time priting the hash when you process a line:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print json.loads(line)['hash']

answered Sep 16 '17 at 17:52

Willem Van Onsem

443,496
30
428
555

excellent. thank you for these insights. Can't accept the answer till 9 minutes later, but I will as soon as I can. – smatthewenglish Sep 16 '17 at 17:55
this `json.loads(line)['gas'] for line in inpt` is totally awesome- do you have some resource where I can learn about that? – smatthewenglish Sep 16 '17 at 17:57
in the edit I made some way of printing an identifier along with the value, do you think it's an ok approach? – smatthewenglish Sep 16 '17 at 18:03
@s.matthew.english: well here you parse the JSON code twice, furthermore the dictionaries have no `'identifier'` or `'value'` key. – Willem Van Onsem Sep 16 '17 at 18:04
@s.matthew.english: `jdon.loads(line)` returns a dictionary (like in your piece in code). You can simply chain operations. By writing `['hash']` we access the `hash` key of the dictionary, so we put that one in the list. – Willem Van Onsem Sep 16 '17 at 18:05
1

@s.matthew.english: see [list comprehension](http://www.secnetix.de/olli/Python/list_comprehensions.hawk). – Willem Van Onsem Sep 16 '17 at 18:05
cool- thank you for the resrouce, and also- I think I see what you mean, I put the new version in **EDIT II** – smatthewenglish Sep 16 '17 at 18:08
@s.matthew.english: but your data looks like `{"record":"value0","block":"0x79"}`. That means every dictionary has **two keys**: `record` and `block`. In case the dictionaries have keys like `identifier`, and `value`, your second edit is ok. – Willem Van Onsem Sep 16 '17 at 18:10
oh yeah- it has two keys, actually it has like 20 key values per line of JSON, but I'm only interested in two of them, it's more like one is a hash, and the value associated with that hash, the next is one characteristic, and the value of that characteristic, you know what I mean? – smatthewenglish Sep 16 '17 at 18:12
1

@s.matthew.english: yes, in that case it is fine. – Willem Van Onsem Sep 16 '17 at 18:13

score 1 · Answer 2 · answered Sep 23 '17 at 02:02

1

If all you want to do is print and you are dealing with massive files you can split your file into n evenly sized chunks where n == number of cores in your CPU.

answered Sep 23 '17 at 02:02

Andres De Castro

51
1

Speed up execution time of JSON-ification and processing of data with Python

2 Answers2