2

I have done some preprocessing and feature selection before, and I have a pickle training input data that consists of list of lists, e.g. (but pickled)

[[1,5,45,13], [23,256,4,2], [1,12,88,78], [-1]]
[[12,45,77,325], [23,257,5,28], [3,7,48,178], [12,77,89,99]]
[[13,22,78,89], [12,33,97], [-1], [-1]]

[-1] is a padding token, but I don't think that matters.

Because the file is numerous of gigabytes, I wish to spare memory and use a generator to read in the pickle line by line (list by list). I already found this answer that could be helpful. This would like follows:

def yield_from_pickle(pfin):
    with open(pfin, 'rb') as fhin:
        while True:
            try:
                yield pickle.load(fhin)
            except EOFError:
                break

The next thing is, that I wish to use this data in a PyTorch (1.0.1) Dataloader. From what I found in other answers, I must feed it a Dataset which you can subset, but which must contain __len__ and __getitem__. It could like this:

class TextDataset(Dataset):
    def __init__(self, pfin):
        self.pfin = pfin

    def __len__(self):
        # memory-lenient way but exhaust generator?
        return sum(1 for _ in self.yield_from_pickle())

    def __getitem__(self, index):
        # ???
        pass

    def yield_from_pickle(self):
        with open(self.pfin, 'rb') as fhin:
            while True:
                try:
                    yield pickle.load(fhin)
                except EOFError:
                    break

But I am not at all sure if this is even possible. How can I implement __len__ and __getitem__ in a sensible way? I don't think what I am doing with __len__ is a good idea because that'll exhaust the generator, and I have no idea at all how to safely implement __getitem__ while retaining the generator.

Is there a better way? To summarize, I want to build a Dataset that can be fed to PyTorch's Dataloader (because of its multiprocessing abilities) but in a memory-efficient way where I don't have to read the whole file into memory.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239

1 Answers1

0

See my other answer for your options.

In short, you need to either preprocess each sample into separate files, or use a data format that does not need to be loaded fully into memory for reading.

Coolness
  • 1,932
  • 13
  • 25