How to train pytorch model using large data file while using Dataloader?

Question

I am using pytorch Dataset class and Dataloader to load data. The class and loader looks like the following.

class Dataset(Dataset):
    def __init__(self):

        self.input_and_label   = json.load(open(path_to_large_json_file)) # Large file
        self.dataset_size      = len(self.input_and_label)

    def __getitem__(self, index): 
        # Convert to tensor
        input_data    = torch.LongTensor(self.input_and_label[index][0]) # Input X 
        label_data    = torch.LongTensor(self.input_and_label[index][1]) # Label y
        
        return input_data, label_data
    
    def __len__(self):
        return self.dataset_size

And the iterator is generated like,

train_loader = torch.utils.data.DataLoader(
    Dataset(),
    # Batch size
    batch_size = 8, # This is expected to be large, 8 is for trial -- didn't work
    shuffle = True,
    pin_memory = False #True 
)

The data-file is a large (json) file. But I am getting memory error as,

<RuntimeError: CUDA out of memory. Tried to allocate... ... ... >

Note:

The large json file content is list of numbers like,

    [0, 1 , 0, 0,..... 4000 numbers]  <-- this is the input_data
    [0, 2, 2, ... 50 numbers ]        <-- this is the label
So, probably batch size 8 (that means 8 such pairs), or 800 ... should not matter much

Can someone please help me, how can I get the iterator without loading the large file at once? Or, any other solution welcome. Thank you very much for your support.

score 1 · Answer 1 · answered May 27 '22 at 21:29

1

You get a CUDA OOM error, it is not related to the file itself being large, but single example being large.

JSON file loads correctly to RAM, but 8 examples cannot fit on your GPU (which is often the case for images/videos, especially with high resolution).

Solutions

Use a larger GPU (e.g. cloud provided)
Use a smaller batch size (even of size 1) and use gradient accumulation. It will run slow, but the results will be as good as those of larger batches.

answered May 27 '22 at 21:29

Szymon Maszke

22,747
4
43
83

Thank you @Szymon Maszke. I have just updated the question to provide more information. I think the batch size not a matter here, as its just a batch of arrays of numbers, not as big as images. – Droid-Bird May 27 '22 at 22:03
@Droid-Bird are you loading anything else? Maybe your model is already on CUDA by that time? How large is it if so? How much GPU RAM do you have? Did you try with a single example to see if it runs? – Szymon Maszke May 27 '22 at 22:15
Do you mean **a single batch being large**? An example will be equally large even if you decrease the batch size. – HelloGoodbye Nov 23 '22 at 14:18

score 0 · Answer 2 · answered Jun 01 '22 at 12:31

The primary approach I used, using Dataset(Dataset) class. For large dataset pytorch provides an iterable, IterableDataset.

Us the following,

class CustomIterableDataset(IterableDataset):

def __init__(self, filename):

    #Store the filename in object's memory
    self.filename = filename
    self.dataset_size = <Provide the file size here>


 def preprocess(self, text):

     ### Do something with the input here
     text_pp = some_processing_function(text)
              
     return text_pp

def line_mapper(self, line):
    
    #Splits the line into input and label
    #         input, label = line.split(',')
    #         text = self.preprocess(text)
    
    in_label, out_label = eval(line)
    
    input_s     = torch.LongTensor(in_label) 
    output_s    = torch.LongTensor(out_label) 

    return input_s, output_s # Input and output stream

def __len__(self):
    return self.dataset_size

def __iter__(self):

    #Create an iterator
    file_itr = open(self.filename)

    #Map each element using the line_mapper
    mapped_itr = map(self.line_mapper, file_itr)
    
    return mapped_itr

For more information, read here

How to train pytorch model using large data file while using Dataloader?

2 Answers2

Solutions