0

I've got a json files of total size of 3gb. I need to parse some data from it to Pandas Dataframe. I already made it a bit faster with custom library to parse json, but it is still too slow. It works only in one thread, that is a problem, too. How can I make it faster? Main problem is that is starts with 60it/s, but on 50000th iteration speed lowers down to 5it/s, but RAM is still not fully used, so it is not the problem. Here is an example of what am I doing:

import tqdm
with open('data/train.jsonlines') as fin:
    for line in tqdm.tqdm_notebook(fin):
        record = ujson.loads(line)
        for target in record['damage_targets']:
            df_train.loc[record['id'], 'target_{}'.format(target)] = record["damage_targets"][target]
keddad
  • 1,398
  • 3
  • 14
  • 35
  • I think one possible optimization is to not change the dataframe on each iteration. Try working with plain python dictionary inside the loop, and then apply the results to your dataframe (i.e. with `assign` or `concat`). – gseva Apr 09 '19 at 17:27
  • @gseva it might be a great idea! Thanks – keddad Apr 09 '19 at 17:30
  • Does the `json_normalize` not work for you? https://stackoverflow.com/a/21266043/8150685 – Error - Syntactical Remorse Apr 09 '19 at 22:03
  • @Error-SyntacticalRemorse Problem is not in the speed of JSON parser itself, but in the continious inserts to DataFrame, as gseva already mentioned. Library ujson is pretty fast, so I don't really need to change it – keddad Apr 10 '19 at 04:00

0 Answers0