How to handle large JSON file in Pytorch?

Question

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?

You probably be looking for Data-loaders and batches and randomization. Look at `torch.utils.data.DataLoader` — BetaDev, Mar 11 '19 at 23:23
I checked. All the documentation there is basically for handle small data sets. — Shamane Siriwardhana, Mar 11 '19 at 23:56
What's the schema in your json file? Is it complex？If not, I think you can use `JsonDataset` in `torchtext`. — cloudyyyyy, Mar 19 '19 at 02:42

score 10 · Answer 1 · answered Aug 04 '20 at 09:39

I suppose IterableDataset (docs) is what you need, because:

you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.

I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.

import json
from torch.utils.data import DataLoader, IterableDataset


class JsonDataset(IterableDataset):
    def __init__(self, files):
        self.files = files

    def __iter__(self):
        for json_file in self.files:
            with open(json_file) as f:
                for sample_line in f:
                    sample = json.loads(sample_line)
                    yield sample['x'], sample['time'], ...

...

dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)

for batch in dataloader:
    y = model(batch)

A problem will arise if having more workers. every worker will start reading from the same file causing duplicates. — theo2021, Apr 08 '23 at 18:33

Shai · Answer 2 · 2020-08-04T09:37:15.743

Generally, you do not need to change/overload the default data.Dataloader.

What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.

If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:

import bisect

class MyJsonsDataset(data.Dataset):
  def __init__(self, jfolder):
    super(MyJsonsDataset, self).__init__()
      self.filenames = []  # keep track of the jfiles you need to load
      self.cumulative_sizes = [0]  # keep track of number of examples viewed so far
      # this is not actually python code - just pseudo code for you to follow
      for each jsonfile in jfolder:
        self.filenames.append(jsonfile)
        l = number of examples in jsonfile
        self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
      # discard the first element 
      self.cumulative_sizes.pop(0)

  def __len__(self):
    return self.cumulative_sizes[-1]

  def __getitem__(self, idx):
    # first you need to know wich of the files holds the idx example
    jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
    if jfile_idx == 0:
      sample_idx = idx
    else:
      sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
    # now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
    return retrieved_example

How to handle large JSON file in Pytorch?

2 Answers2