I am using pytorch Dataset class and Dataloader to load data. The class and loader looks like the following.
class Dataset(Dataset):
def __init__(self):
self.input_and_label = json.load(open(path_to_large_json_file)) # Large file
self.dataset_size = len(self.input_and_label)
def __getitem__(self, index):
# Convert to tensor
input_data = torch.LongTensor(self.input_and_label[index][0]) # Input X
label_data = torch.LongTensor(self.input_and_label[index][1]) # Label y
return input_data, label_data
def __len__(self):
return self.dataset_size
And the iterator is generated like,
train_loader = torch.utils.data.DataLoader(
Dataset(),
# Batch size
batch_size = 8, # This is expected to be large, 8 is for trial -- didn't work
shuffle = True,
pin_memory = False #True
)
The data-file is a large (json) file. But I am getting memory error as,
<RuntimeError: CUDA out of memory. Tried to allocate... ... ... >
Note:
The large json file content is list of numbers like,
[0, 1 , 0, 0,..... 4000 numbers] <-- this is the input_data
[0, 2, 2, ... 50 numbers ] <-- this is the label
So, probably batch size 8 (that means 8 such pairs), or 800 ... should not matter much
Can someone please help me, how can I get the iterator without loading the large file at once? Or, any other solution welcome. Thank you very much for your support.