I've got a json files of total size of 3gb. I need to parse some data from it to Pandas Dataframe. I already made it a bit faster with custom library to parse json, but it is still too slow. It works only in one thread, that is a problem, too. How can I make it faster? Main problem is that is starts with 60it/s, but on 50000th iteration speed lowers down to 5it/s, but RAM is still not fully used, so it is not the problem. Here is an example of what am I doing:
import tqdm
with open('data/train.jsonlines') as fin:
for line in tqdm.tqdm_notebook(fin):
record = ujson.loads(line)
for target in record['damage_targets']:
df_train.loc[record['id'], 'target_{}'.format(target)] = record["damage_targets"][target]