I am trying to make a customized Dataloader using pytorch.
I've seen some codes like (omitted the class sorry.)
def __init__(self, data_root, transform=None, training=True, return_id=False):
super().__init__()
self.mode = 'train' if training else 'test'
self.data_root = Path(data_root)
csv_fname = 'train.csv' if training else 'sample_submission.csv'
self.csv_file = pd.read_csv(self.data_root / csv_fname)
self.transform = transform
self.return_id = return_id
def __getitem__():
""" TODO
"""
def __len__():
""" TODO
"""
The problem here is that the datas I've dealt with before contains all the training data in one csv file, and all the testing data in the other csv file, with total 2 csv files for training and testing. (For example like in MNIST, the last column is the labeling, and the all the previous columns are each different features.)
However, the problem I've been facing is that I've got very many, (about 200,000) csv files for training, each one smaller than 60,000 sized MNIST, but still quite big. All these csv files contain different number of rows.
To inherit torch.util.data, how can I make customized class? MNIST dataset is quite small, so can be uploaded on RAM at once. However, the data I'm dealing with is super big, so I need some help.
Any ideas? Thank you in advance.