0

My script looks like this:

with open('toy.json', 'rb') as inpt:

    lines = [json.loads(line) for line in inpt]

    for line in lines: 
        records = [item['hash'] for item in lines]
    for item in records: 
        print item

What it does is read in data where each line is valid JSON, but the file as a whole is not valid JSON. The reason for that is because it's an aggregated dump from a web service.

The data looks, more or less, like this:

{"record":"value0","block":"0x79"} 
{"record":"value1","block":"0x80"} 

So the code above works, it allows me to interact with the data as JSON, but it's so slow that it's essentially useless.

Is there a good way to speed up this process?

EDIT:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print("identifier: "+json.loads(line)['identifier'])
        print("value:  "+json.loads(line)['value'])

EDIT II:

for line in inpt:
    resource = json.loads(line)
    print(resource['identifier']+", "+resource['value'])
halfer
  • 19,824
  • 17
  • 99
  • 186
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72

2 Answers2

2

You write:

for line in lines: 
    records = [item['hash'] for item in lines]

But this means that you will construct that records list n times (with n the number of lines). This is useless, and makes the time complexity O(n2).

You can speed this up with:

with open('toy.json', 'rb') as inpt:

    for item in [json.loads(line)['hash'] for line in inpt]:
        print item

Or you can reduce the memory burden, by each time priting the hash when you process a line:

with open('toy.json', 'rb') as inpt:

    for line in inpt:
        print json.loads(line)['hash']
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • excellent. thank you for these insights. Can't accept the answer till 9 minutes later, but I will as soon as I can. – smatthewenglish Sep 16 '17 at 17:55
  • this `json.loads(line)['gas'] for line in inpt` is totally awesome- do you have some resource where I can learn about that? – smatthewenglish Sep 16 '17 at 17:57
  • in the edit I made some way of printing an identifier along with the value, do you think it's an ok approach? – smatthewenglish Sep 16 '17 at 18:03
  • @s.matthew.english: well here you parse the JSON code twice, furthermore the dictionaries have no `'identifier'` or `'value'` key. – Willem Van Onsem Sep 16 '17 at 18:04
  • @s.matthew.english: `jdon.loads(line)` returns a dictionary (like in your piece in code). You can simply chain operations. By writing `['hash']` we access the `hash` key of the dictionary, so we put that one in the list. – Willem Van Onsem Sep 16 '17 at 18:05
  • 1
    @s.matthew.english: see [list comprehension](http://www.secnetix.de/olli/Python/list_comprehensions.hawk). – Willem Van Onsem Sep 16 '17 at 18:05
  • cool- thank you for the resrouce, and also- I think I see what you mean, I put the new version in **EDIT II** – smatthewenglish Sep 16 '17 at 18:08
  • @s.matthew.english: but your data looks like `{"record":"value0","block":"0x79"}`. That means every dictionary has **two keys**: `record` and `block`. In case the dictionaries have keys like `identifier`, and `value`, your second edit is ok. – Willem Van Onsem Sep 16 '17 at 18:10
  • oh yeah- it has two keys, actually it has like 20 key values per line of JSON, but I'm only interested in two of them, it's more like one is a hash, and the value associated with that hash, the next is one characteristic, and the value of that characteristic, you know what I mean? – smatthewenglish Sep 16 '17 at 18:12
  • 1
    @s.matthew.english: yes, in that case it is fine. – Willem Van Onsem Sep 16 '17 at 18:13
1

If all you want to do is print and you are dealing with massive files you can split your file into n evenly sized chunks where n == number of cores in your CPU.