2

I am beginner pytorch user, and I am trying to use dataloader.

Actually, I am trying to implement this into my network but it takes a very long time to load. And so, I debugged my network to see if the network itself has the problem, but it turns out it has something to with my dataloader class. Here is the code:

 from torch.utils.data import Dataset, DataLoader
 import numpy as np
 import pandas as pd

class DiabetesDataset(Dataset):

  def __init__(self, csv):
      self.xy = pd.read_csv(csv)

  def __len__(self):
      return len(self.xy)

  def __getitem__(self, index):
       self.x_data = torch.Tensor(xy.iloc[:, 0:-1].values)
       self.y_data = torch.Tensor(xy.iloc[:, [-1]].values)
       return self.x_data[index], self.y_data[index]

 dataset = DiabetesDataset("trial.csv")
 train_loader = DataLoader(dataset=dataset,
                      batch_size=1,
                      shuffle=True,
                      num_workers=2)`

 for a in train_loader:
    print(a)

To verify that the dataloader causes all the delay, I created a dummy csv file with 2 columns of 1s and 2s, for a total of 10 samples for each columns. Then, I looped over the train_loader object, it has been more than 1 hr and it is still running, considering that the sample size is small and batch size is set to 1.

I am not sure as to what the error to my code is and it is causing this issue.

Any comments/inputs are greatly appreciated!

BVG
  • 65
  • 2
  • 6

1 Answers1

3

There are some bugs in your code - could you check if this works (it is working on my computer with your toy example):

from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import torch


class DiabetesDataset(Dataset):

    def __init__(self, csv):
        self.xy = pd.read_csv(csv)

    def __len__(self):
        return len(self.xy)

    def __getitem__(self, index):
        x_data = torch.Tensor(self.xy.iloc[:, 0:-1].values)
        y_data = torch.Tensor(self.xy.iloc[:, [-1]].values)
        return x_data[index], y_data[index]


dataset = DiabetesDataset("trial.csv")


train_loader = DataLoader(
    dataset=dataset,
    batch_size=1,
    shuffle=True,
    num_workers=2)

if __name__ == '__main__':
    for a in train_loader:
        print(a)

Edit: Your code is not working because you are missing a self in the __getitem__ method (self.xy.iloc...) and because you do not have a if __name__ == '__main__ at the end of your script. For the second error, see RuntimeError on windows trying python multiprocessing

Jonathan Guymont
  • 497
  • 2
  • 11
  • Thanks for the quick response! I copied and pasted your code, but it is still running (5 minutes have elapsed). – BVG Feb 27 '19 at 05:14
  • Are you using the `if __name__ == '__main__'` at the end? – Jonathan Guymont Feb 27 '19 at 05:23
  • Could you try setting `num_workers=0` in the `DataLoader`? – Jonathan Guymont Feb 27 '19 at 05:38
  • it worked!! thank you very much. Would you know if this has something to do with the computing power/specs of my laptop? – BVG Feb 27 '19 at 05:45
  • I am really not an expert. You should read about subprocesses. There is something wrong if you can't use 2 subprocesses to load the data, but I cannot tell you the source. You could start by looking at your task manager to check your CPU usage. – Jonathan Guymont Feb 27 '19 at 05:59