1

I am running a loop with data coming in and writing the data to a json file. Here's what it looks like in a minimal verifiable concrete example.

import json
import random
import string

dc_master= {}
for i in range(100):
    # Below mimics an API call that returns new data.
    name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
    dc_info= {}
    dc_info['Height'] = 'NA'

    dc_master[name] = dc_info

    with open("myfile.json", "w") as filehandle:
        filehandle.write(json.dumps(dc_master))

As you can see from the above, every time it loops, it creates a new dc_info. That becomes the value of the new key-value pair (with the key being the name) that gets written to the json file.

The one disadvantage of the above is that when it fails and I restart again, I have to do it from the very beginning. Should I do a open for reading of the json file to dc_master, then add a name:dc_info to the dictionary, followed by writing the dc_master back to the json file at every turn of the loop? Should I just append to the json file even if it's a duplicate and let the fact that when I need to use it, I will load it back into a dictionary and that takes care of duplicates automatically?

Additional information: There are occasionally timeouts, so I want to be able to start somewhere in the middle if needed. Number of key value pairs in the dc_info is about 30 and number of overall name:dc_info pairs is about 1000. So it's not huge. Reading it out and writing it back in again is not onerous. But I do like to know if there's a more efficient way of doing it.

Spinor8
  • 1,587
  • 4
  • 21
  • 48

2 Answers2

1

I think you're fine and I'd loop over the whole thing and keep writing to the file, as it's cheap.

As for retries, you would have to check for a timeout and then see if the JSON file is already there, load it up, count your keys and then fetch the missing number of entries.

Also, your example can be simplified a bit.

import json
import random
import string


dc_master = {}
for _ in range(100):
    name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
    dc_master.update({name: {"Height": "NA"}})

with open("myfile.json", "w") as jf:
     json.dump(dc_master, jf, sort_keys=True, indent=4)

EDIT:

On second thought, you probably want to use a JSON list instead of a dictionary as the top level element, so it's easier to check how much you've got already.

import json
import os
import random
import string


output_file = "myfile.json"
max_entries = 100
dc_master = []


def do_your_stuff(data_container, n_entries=max_entries):
    for _ in range(n_entries):
        name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
        data_container.append({name: {"Height": "NA"}})
    return data_container


def dump_data(data, file_name):
    with open(file_name, "w") as jf:
        json.dump(data, jf, sort_keys=True, indent=4)


if not os.path.isfile(output_file):
    dump_data(do_your_stuff(dc_master), output_file)
else:
    with open(output_file) as f:
        data = json.load(f)
        if len(data) < max_entries:
            new_entries = max_entries - len(data)
            dump_data(do_your_stuff(data, new_entries), output_file)
            print(f"Added {new_entries} entries.")
        else:
            print("Nothing to update.")
baduker
  • 19,152
  • 9
  • 33
  • 56
  • Isn't same file rewritten 100 times from scratch? Maybe "a" mode plus "\n" newline should be used? – Arty Sep 27 '20 at 09:37
1

I think the full script of fetching and storing API results should look like example code below. At least I always do same code for long-processing set of tasks.

I put each result of API call as a separate JSON single line in result file.

Script may be stopped in the middle e.g. due to exception, file will be correctly closed and flushed thanks to with manager. Then on restart script will read already processed result lines from file.

Only those results that have not being processed already (if their id not in processed_ids) should and will be fetched from API. id-field may be anything that identifies uniquely each API call result.

Each next result will be appended to json-lines file thanks to a mode (append mode). buffering specifies write-buffer size in bytes, file will be flushed and written in this size of blocks, this is not to stress disk with sequent one-line-100-bytes writes. Using large buffering is totally alright because Python's with block correctly flushes and writes out all bytes whenever block exits due to exception or any other reason, so you'll never lose even a single small result or byte that already has being written by f.write(...).

Final results will be printed to console.

Because your task is very interesting and important (at least I had similar tasks many times), I've decided to also implement multi-threaded version of the single-threaded code located below, it is aspecially needed for the case of fetching data from Internet, as it is usually necessary to download data in several parallel threads. Multi-threaded version can be found and run here and here. This multi-threading can be extended to multi-processing too for efficiency by using ideas from my another answer.

Next is single-threaded version of code:

Try next code online here!

import json, os, random, string

fname = 'myfile.json'
enc = 'utf-8'
id_field = 'id'

def ReadResults():
    results = []
    processed_ids = set()
    if os.path.exists(fname):
        with open(fname, 'r', encoding = enc) as f:
            data = f.read()
        results = [json.loads(line) for line in data.splitlines() if line.strip()]
        processed_ids = {r[id_field] for r in results}
    return (results, processed_ids)

# First read already processed elements
results, processed_ids = ReadResults()

with open(fname, 'a', buffering = 1 << 20, encoding = enc) as f:
    for id_ in range(100):
        # !!! Only process ids that are not in processed_ids !!!
        if id_ in processed_ids:
            continue        
        
        # Below mimics an API call that returns new data.
        # Should fetch only those objects that correspond to id_.
        name = ''.join(random.choice(string.ascii_uppercase) for _ in range(15))
        
        # Fill necessary result fields
        result = {}
        result['id'] = id_
        result['name'] = name
        result['field0'] = 'value0'
        result['field1'] = 'value1'
        
        cid = result[id_field] # There should be some unique id field
        
        assert cid not in processed_ids, f'Processed {cid} twice!'

        f.write(json.dumps(result, ensure_ascii = False) + '\n')
        
        results.append(result)
        processed_ids.add(cid)

print(ReadResults()[0])
Arty
  • 14,883
  • 6
  • 36
  • 69
  • Thanks for your answer. Just two questions. If I want to the json file to be text readable, do I just use mode a instead of ab. Secondly, what is the point of buffering and what exact does "buffering = 1 << 20" mean? – Spinor8 Sep 27 '20 at 10:16
  • 1
    @Spinor8 File actually holds only text results, I do conversion to and from bytes using `.encode('utf-8')` and `.decode('utf-8')` methods. Instead of this you may open file with `a` and `r` modes intead of `ab` and `rb`, and provide `encoding = 'utf-8'` param like this `with open(fname, 'a', encoding = 'utf-8', buffering = 1 << 20):`. Meaning of buffering is next - it specifies block size in bytes, file will be written out and flushed after each `buffering` bytes block is ready, this just helps not to terror disk by frequent small lines writing. – Arty Sep 27 '20 at 10:42
  • 1
    @Spinor8 Using big `buffering` value is totally alright when writing, because thanks to `with` block Python will always close and flush file whenever exception occurs or block is finished. So you'll never lose even a single small result even with big `buffering` value. – Arty Sep 27 '20 at 10:44
  • 1
    @Spinor8 I've just modified/corrected my code to use text mode of opening file both for reading and writing, as you asked for that. – Arty Sep 27 '20 at 10:50
  • 1
    @Spinor8 Hope you understood my way of checking already processed tasks by ids (`processed_ids` set). If you have nothing like ids in your case then you may iterate through `results` list before each API call to figure out somehow what was already being processed and what else needs to be fetched. – Arty Sep 27 '20 at 10:53
  • Yes, I understood the processed_ids. That piece of code is clever. Just by rerunning the same script will start from the point of failure. No need to count where it failed and manually enter the point of restart. – Spinor8 Sep 27 '20 at 10:57
  • 1
    @Spinor8 Because your task is very interesting and important (at least I had similar tasks many times), I've decided to also implement multi-threaded version of the single-threaded code located below, it is aspecially needed for the case of fetching data from Internet, as it is usually necessary to download data in several parallel threads. Multi-threaded version [can be found and run here](https://repl.it/@moytrage/StackOverflow64086700var2#main.py). Multi-threading can be extended to multi-processing by using [ideas from my another answer](https://stackoverflow.com/a/64080001/941531). – Arty Sep 27 '20 at 11:36
  • Very nice. I am actually downloading stuff from the internet. Unfortunately they throttle me if I send too many requests, so I never implemented it as a multiprocessing task. Also, don't you need some sort of lock when multiple processes are writing to the same file? – Spinor8 Sep 27 '20 at 13:22
  • 1
    @Spinor8 No, in my code there is no need for syncronization of read/write due to simple reason - all results are gathered in single main thread/process and only this thread writes to JSON file, hence there is no concurrency. Of cause if all child threads write to same file then synchronization is needed. – Arty Sep 27 '20 at 14:21