torch dataloader for large csv file - incremental loading

Question

I am trying to write a custom torch data loader so that large CSV files can be loaded incrementally (by chunks).

I have a rough idea of how to do that. However, I keep getting some PyTorch error that I do not know how to solve.


import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

# Create dummy csv data
nb_samples = 110
a = np.arange(nb_samples)
df = pd.DataFrame(a, columns=['data'])
df.to_csv('data.csv', index=False)


# Create Dataset
class CSVDataset(Dataset):
    def __init__(self, path, chunksize, nb_samples):
        self.path = path
        self.chunksize = chunksize
        self.len = nb_samples / self.chunksize

    def __getitem__(self, index):
        x = next(
            pd.read_csv(
                self.path,
                skiprows=index * self.chunksize + 1,  #+1, since we skip the header
                chunksize=self.chunksize,
                names=['data']))
        x = torch.from_numpy(x.data.values)
        return x

    def __len__(self):
        return self.len


dataset = CSVDataset('data.csv', chunksize=10, nb_samples=nb_samples)
loader = DataLoader(dataset, batch_size=10, num_workers=1, shuffle=False)

for batch_idx, data in enumerate(loader):
    print('batch: {}\tdata: {}'.format(batch_idx, data))

I get 'float' object cannot be interpreted as an integer error

score 2 · Answer 1 · answered Jan 02 '22 at 09:45

2

The error is caused by this line:

self.len = nb_samples / self.chunksize

When dividing using / the result is always a float. But you can only return an integer in the __len__() function. Therefore you have to round self.len and/or convert it to an integer. For example by simply doing this:

self.len = nb_samples // self.chunksize

the double slash (//) rounds down and converts to integer.

Edit: You acutally CAN return a float in __len__() but when calling len(dataset) the error will occur. So I guess len(dataset) is called somewhere inside the DataLoader class.

answered Jan 02 '22 at 09:45

Theodor Peifer

3,097
4
17
30

Many thanks for this suggestion. However, with this fix I get a new error: ```DataLoader worker (pid(s) 18357) exited unexpectedly``` – Petr Jan 02 '22 at 09:50
this error is unrelated. But maybe [this](https://stackoverflow.com/a/60101662/8909353) answer helps – Theodor Peifer Jan 02 '22 at 12:25

torch dataloader for large csv file - incremental loading

1 Answers1