I'm trying to get PyTorch to work with DataLoader, this being said to be the easiest way to handle mini batches, which are in some cases necessary for best performance.
DataLoader wants a Dataset as input.
Most of the documentation on Dataset assumes you are working with an off-the-shelf standard data set e.g. MNIST, or at least with images, and can use existing machinery as a black box. I'm working with non-image data I'm generating myself. My best current attempt to distill the documentation about how to do that, down to a minimal test case, is:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
class Dataset1(Dataset):
def __init__(self):
pass
def __len__(self):
return 80
def __getitem__(self, i):
# actual data is blank, just to test the mechanics of Dataset
return [0.0, 0.0, 0.0], 1.0
train_dataloader = DataLoader(Dataset1(), batch_size=8)
for X, y in train_dataloader:
print(f"X: {X}")
print(f"y: {y.shape} {y.dtype} {y}")
break
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layers = nn.Sequential(
nn.Linear(3, 10),
nn.ReLU(),
nn.Linear(10, 1),
nn.Sigmoid(),
)
def forward(self, x):
return self.layers(x)
device = torch.device("cpu")
model = Net().to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(10):
for X, y in train_dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
loss = criterion(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
The output of the above program is:
X: [tensor([0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64), tensor([0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64), tensor([0., 0., 0., 0., 0., 0., 0., 0.], dtype=torch.float64)]
y: torch.Size([8]) torch.float64 tensor([1., 1., 1., 1., 1., 1., 1., 1.], dtype=torch.float64)
Traceback (most recent call last):
File "C:\ml\test_dataloader.py", line 47, in <module>
X, y = X.to(device), y.to(device)
AttributeError: 'list' object has no attribute 'to'
In all the example code I can find, X, y = X.to(device), y.to(device)
succeeds, because X
is indeed a tensor (whereas it is not in my version). Now I'm trying to find out what exactly converts X
to a tensor, because either the example code e.g. https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html does not do so, or I am failing to understand how and where it does.
Does Dataset itself convert things to tensors? The answer seems to be 'sort of'.
It has converted y
to a tensor, a column of the y
value for every example in the batch. That much, makes sense, though it has used type float64, whereas in machine learning, we usually prefer float32. I am used to the idea that Python always represents scalars in double precision, so the conversion from double to single precision happens at the time of forming a tensor, and that this can be insured by specifying the dtype
parameter. But in this case Dataset seems to have formed the tensor implicitly. Is there a place or way to specify the dtype
parameter?
X
is not a tensor, but a list thereof. It would make intuitive sense if it were a list of the examples in the batch, but instead of a list of 8 elements each containing 3 elements, it's the other way around. So Dataset has transposed the input data, which would make sense if it is forming a tensor to match the shape of y
, but instead of making a single 2d tensor, it has made a list of 1d tensors. (And, again, in double precision.) Why? Is there a way to change this behavior?
The answer posted so far to Does pytorch Dataset.__getitem__ have to return a dict? says __getitem__
can return anything. Okay, but then how does the anything get converted to the form the training procedure requires?