How to customize pytorch data

Question

I am trying to make a customized Dataloader using pytorch.

I've seen some codes like (omitted the class sorry.)

def __init__(self, data_root, transform=None, training=True, return_id=False):
    super().__init__()
    self.mode = 'train' if training else 'test'

    self.data_root = Path(data_root)
    csv_fname = 'train.csv' if training else 'sample_submission.csv'
    self.csv_file = pd.read_csv(self.data_root / csv_fname)
    self.transform = transform
    self.return_id = return_id
def __getitem__():
    """ TODO
    """
def __len__():
    """ TODO
    """

The problem here is that the datas I've dealt with before contains all the training data in one csv file, and all the testing data in the other csv file, with total 2 csv files for training and testing. (For example like in MNIST, the last column is the labeling, and the all the previous columns are each different features.)

However, the problem I've been facing is that I've got very many, (about 200,000) csv files for training, each one smaller than 60,000 sized MNIST, but still quite big. All these csv files contain different number of rows.

To inherit torch.util.data, how can I make customized class? MNIST dataset is quite small, so can be uploaded on RAM at once. However, the data I'm dealing with is super big, so I need some help.

Any ideas? Thank you in advance.

do you know how many rows in each file? – Shai Aug 15 '19 at 03:48 — Shai, Aug 15 '19 at 03:48
can definitely check, yes. – piljae.chae Aug 15 '19 at 09:29 — piljae.chae, Aug 15 '19 at 09:29

score 2 · Answer 1 · answered Aug 15 '19 at 04:28

First, you want to customize (overload) data.Dataset and not data.DataLoader which is perfectly fine for your use case.

What you can do, instead of loading all data to RAM, is to read and store "meta data" on __init__ and read one relevant csv file whnever you need to __getitem__ a specific entry.
A pseudo-code of your Dataset will look something like:

class ManyCSVsDataset(data.Dataset):
  def __init__(self, ...):
    super(ManyCSVsDataset, self).__init__()
    # store the paths for all csvs and the number of items in each one
    self.metadata = ... 
    self.num_items = total_number_of_items

  def __len__(self):
    return self.num_items

  def __getitem__(self, index):
    # based on the index, use self.metadata to determine what csv file to open
    with open(relevant_csv_file, 'r') as R:
      # read from R the specific line matching item index
    return item

This implementation is not efficient in the sense that it reads the same csv file over and over and does not cache anything. On the other hand, you can take advantage of data.DataLoader's multi processing support to have many parallel sub-processes doing all these file access at the background while you actually use the data for training.

How to customize pytorch data

1 Answers1

Linked