1

I want to create a class that will be able to apply transformations, including shuffle, to a dataset. The prototype I came up with looks something like this:

import numpy as np

class Dataset:
  def __init__(self, src, tgt):
    self.src = src
    self.tgt = tgt
    self.gen = ((s, t) for (s, t) in zip(self.src, self.tgt))
    
  def __iter__(self):
    return self.gen
  
  def __next__(self):
    for pt in self.gen:
      return pt
      
  def shuffle(self):
    self.gen = (pt for pt in np.random.shuffle([pt for pt in zip(self.src, self.tgt)]))

The generator self.gen is successfully created, but I get an error when using .shuffle() method:

self.gen = (pt for pt in np.random.shuffle([pt for pt in zip(self.src, self.tgt)]))
TypeError: 'NoneType' object is not iterable

I understand that the generator is not created, but I do not understand why. Would appreciate some help and explanation why my attempt was futile.

Aramakus
  • 1,910
  • 2
  • 11
  • 22
  • 2
    `np.random.shuffle()` doesn't return anything. You need to iterate over the array that you passed to it. – jasonharper Jul 20 '20 at 13:26
  • Does this answer your question? [Shuffling a list of objects](https://stackoverflow.com/questions/976882/shuffling-a-list-of-objects) – Matteo Peluso Jul 20 '20 at 13:28
  • Indeed, thanks! Would `[pt for pt in zip(self.src, self.tgt)]` create a copy of the data, and if it will, do you know a good way to avoid that? – Aramakus Jul 20 '20 at 13:30
  • Your iterator implementation is broken. Your `Dataset` object shouldn;'t be an iterator to begin with. It should be *iterable*. I.E. it should only define an `__iter__` method, which returns a new iterator. Iterators must return `self` from `__iter__`. And they should raise `StopIteration` when they are exhausted. – juanpa.arrivillaga Jul 20 '20 at 13:31
  • In this case, you should just implement `__iter__`, and make it a generator function, that first shuffles your data, then iterates over the shuffled data and and yields each item individually. – juanpa.arrivillaga Jul 20 '20 at 13:32
  • @Aramakus it creates a new list, yes. You were already doing that anyway. – juanpa.arrivillaga Jul 20 '20 at 13:32
  • @juanpa.arrivillaga, thank you!! Makes excellent sense, I have not touched `__iter__` before and did not know how it works. – Aramakus Jul 20 '20 at 13:33
  • Rather than creating one single generator in `__init__` (which will be exhausted after the first use), create it in `__iter__` to get a fresh iterator based on `src` and `tgt` each time. – chepner Jul 20 '20 at 13:35
  • @chepner, I want to be able to create different generators from by applying several possible transformations to data, `shuffle` is only one of them. I do not understand how can I do that inside `__iter__`, while maintaining an ability to pick which transformation I want to use. – Aramakus Jul 20 '20 at 13:39

1 Answers1

1

You don't need Dataset.gen. In fact, that's only going to make things more complicated, because generators need to hold on to the original value being iterated over. You also don't need numpy as a dependency.

import random

class Dataset:
    def __init__(self, *sources):
        if len(sources) == 1:
            self.sources, = sources
        else:
            self.sources = zip(sources)
    
    def __iter__(self):
        # default iterator
        return iter(self.sources)

    def shuffled(self):
        # returns a **new** Dataset, shuffled from the original
        c = list(self)
        random.shuffle(c)
        return Dataset(c)

Now, this generalizes Dataset so it accepts any number of arguments, each argument specifying a "column" in the dataset. I don't know if this is the sort of thing you want. If not, please try to edit your question to be more precise in how your code is going to be used (preferably with some examples).

Jasmijn
  • 9,370
  • 2
  • 29
  • 43