I have done some preprocessing and feature selection before, and I have a pickle training input data that consists of list of lists, e.g. (but pickled)
[[1,5,45,13], [23,256,4,2], [1,12,88,78], [-1]]
[[12,45,77,325], [23,257,5,28], [3,7,48,178], [12,77,89,99]]
[[13,22,78,89], [12,33,97], [-1], [-1]]
[-1]
is a padding token, but I don't think that matters.
Because the file is numerous of gigabytes, I wish to spare memory and use a generator to read in the pickle line by line (list by list). I already found this answer that could be helpful. This would like follows:
def yield_from_pickle(pfin):
with open(pfin, 'rb') as fhin:
while True:
try:
yield pickle.load(fhin)
except EOFError:
break
The next thing is, that I wish to use this data in a PyTorch (1.0.1) Dataloader. From what I found in other answers, I must feed it a Dataset which you can subset, but which must contain __len__
and __getitem__
. It could like this:
class TextDataset(Dataset):
def __init__(self, pfin):
self.pfin = pfin
def __len__(self):
# memory-lenient way but exhaust generator?
return sum(1 for _ in self.yield_from_pickle())
def __getitem__(self, index):
# ???
pass
def yield_from_pickle(self):
with open(self.pfin, 'rb') as fhin:
while True:
try:
yield pickle.load(fhin)
except EOFError:
break
But I am not at all sure if this is even possible. How can I implement __len__
and __getitem__
in a sensible way? I don't think what I am doing with __len__
is a good idea because that'll exhaust the generator, and I have no idea at all how to safely implement __getitem__
while retaining the generator.
Is there a better way? To summarize, I want to build a Dataset that can be fed to PyTorch's Dataloader (because of its multiprocessing abilities) but in a memory-efficient way where I don't have to read the whole file into memory.