5

I've 2 json files of size data_large(150.1mb) and data_small(7.5kb). The content inside each file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each file.

While dealing with data_small, I did the following and I was able to view its content with 0.1 secs.

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

But while dealing with data_large, I did the following and my system got hanged, slow and had to force shut-it down to bring it into its normal speed. It took around 2 mins to print its content.

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

How can I increase the efficiency of the program while dealing with large data-sets?

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
python-coder
  • 2,128
  • 5
  • 26
  • 37
  • For large json files see: http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python/10382359#10382359 That answer suggests ijson – vinod Jan 04 '14 at 08:23
  • @vinod - Cant i do with python inbuilt libraries? – python-coder Jan 04 '14 at 08:27
  • `json` builtin lib loads the whole file at once. If you need to iterate over it, then you will need to manually parse the json file or just install a lib like `ijson`. – miki725 Jan 04 '14 at 08:31
  • @python-coder Just comment the `print` statement and execute your program with `data_large` – thefourtheye Jan 04 '14 at 08:31
  • @thefourtheye - I commented print state , but again I need to force shut down my system. God you gonna corrupt make my system. – python-coder Jan 04 '14 at 08:41
  • @python-coder alright, I put up an answer using the std libs. – vinod Jan 04 '14 at 08:56

2 Answers2

6

Since your json file is not that large and you can afford to open it into ram all at once, you can get all unique values like:

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

If you will need to process even bigger files, then look for other libraries which can parse a json file iteratively.

miki725
  • 27,207
  • 17
  • 105
  • 121
  • 1
    Using a [_generator expression_](http://docs.python.org/2/reference/expressions.html?highlight=generator%20expression#generator-expressions) in the second method would avoid creating a temporary array -- `list` actually -- with all the values in it. Just use `values = set(i['score'] for i in content)`. – martineau Jan 04 '14 at 12:45
  • It took `201secs` to print the unique values. Though `content = ijson.items(f, 'item')` loads quickly, but `print set(i['score'] for i in content)` is actually taking a long time. Can this be made more efficient? – python-coder Jan 04 '14 at 16:07
  • If there are many values to print, it will always take quite a bit of time... its better to dump the results back into a file. – miki725 Jan 04 '14 at 19:40
  • @python-coder: Have you tried it with `set([i['score'] for i in content])`? Even though that creates a temporary set, doing so might be faster because using a generator expression trades execution time off against memory usage. On the other hand it may not matter because the bottleneck is most likely printing of all the characters no matter how they're being generated -- so maki725's suggestion of writing them to file would be the fastest way to output the results. -- which has to be what you're ultimately trying to accomplish. – martineau Jan 09 '14 at 00:30
0

If you want to iterate over the JSON file in smaller chunks to preserve RAM, I suggest the approach below, based on your comment that you did not want to use ijson to do this. This only works because your sample input data is so simple and consists of an array of dictionaries with one key and one value. It would get complicated with more complex data, and I would go with an actual JSON streaming library at that point.

import json

bytes_to_read = 10000
unique_scores = set()

with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
    # Find indices of dictionaries in chunk
    if '{' not in chunk:
        break
    opening = chunk.index('{')
    ending = chunk.rindex('}')

    # Load JSON and set scores.
    score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
    for s in score_dicts:
        unique_scores.add(s.values()[0])

    # Read next chunk from last processed dict.
    f.seek(-(len(chunk) - ending) + 1, 1)
    chunk = f.read(bytes_to_read)
print unique_scores
vinod
  • 2,358
  • 19
  • 26
  • Well, I tried this, and its still taking a long time print the unique values. `f = open ('data_large') content = ijson.items(f, 'item') print set(i['score'] for i in content)` – python-coder Jan 04 '14 at 15:43
  • It took `201secs` to print the unique values. Though `content = ijson.items(f, 'item')` loads quickly, but `print set(i['score'] for i in content)` is actually taking a long time. Can this be made more efficient? – python-coder Jan 04 '14 at 15:47