1

I have a list of words(tokens) through which I iterate. I want to perform a certain transformation on moving windows of that list. The size of the windows size can be of variable length.

for i in range(0,len(tokens)-(window_size+1),step):
    doc2vec.model.infer_vector(tokens[i:i+window_size])

The for loop goes through the length of the tokens at a step defined in the variable, it takes as many token as the variable window_size says. The problem I see is in the last iteration. The iteration ends at the the length of the tokens - the windows size(+1 so that the substracted value is included). Let's say the window size is 10 and the step is 5 and the length of tokens is 98. In such a situation my code would do the last calculation at 85:95 and leave out the last three elements. I want to a solution that would work for variable window_size, step and tokens length. To illustrate, as of now it would work fine if the length of tokens is 95, but if it is 98 three elements would be left. I would want them to be calculated together 88:98.

Borut Flis
  • 15,715
  • 30
  • 92
  • 119
  • but should there be a superposition on the last window different from the step? in your example the last batch is 85:95, do you want to make an additional 88:98 batch overriding the current step? – tatarana Oct 06 '20 at 11:13
  • Yes I want the window 85:95 processed and then the window 88:98. – Borut Flis Oct 06 '20 at 18:58

1 Answers1

1

I think the way to go is creating your own custom iterator:

class MovingWindow:
    def __init__(self, tokens, window_size, step):
        self.current = -step
        self.last = len(tokens) - window_size + 1
        self.remaining = (len(tokens) - window_size) % step
        self.tokens = tokens
        self.window_size = window_size
        self.step = step

    def __iter__(self):
        return self

    def __next__(self):
        self.current += self.step
        if self.current < self.last:
            return self.tokens[self.current : self.current + self.window_size]
        elif self.remaining:
            self.remaining = 0
            return self.tokens[-self.window_size:]
        else:
            raise StopIteration

witch you will access with:

for t in MovingWindow(tokens, 10, 5):
    doc2vec.model.infer_vector(t)

you could also modify the iterator so it return the indexes instead of the tokens. And another option is to create a simple generator, more information here

to illustrate the case example you provided:

indexes = [i for i in range(98)]
for i in MovingWindow(indexes, 10, 5):
    print(f'{i[0]}:{i[-1]}')

output:

0:9
5:14
10:19
15:24
20:29
25:34
30:39
35:44
40:49
45:54
50:59
55:64
60:69
65:74
70:79
75:84
80:89
85:94
88:97
tatarana
  • 198
  • 1
  • 8
  • Thank you, self.remaining = (len(tokens) + step) % window_size I think is a confused way to calculate the leftover words. len(tokens) - window_size gives the actual number, however your way does not lead to a faulty result. – Borut Flis Oct 11 '20 at 12:48
  • 1
    Hi Borut. I guess you meant "len(tokens) % window_size" right? I've tried this at first but it leads to an error when len(tokens) = 95. As you see it will get the left over of /10 witch is 5 but you will get a duplicate list on the end since the step matches perfectly. I've rerun my tests and actually I've made a mistake, the correct way is "(len(tokens) + window_size) % step". Please let me know if you find a better way to simplify and thanks for the response. – tatarana Oct 11 '20 at 13:55
  • Yes, (len(tokens)+window_size) % step is the correct one or (len(tokens)+window_size) % step it is the same thing. What type of tests did you use? – Borut Flis Oct 11 '20 at 15:27
  • I mean (len(tokens)-window_size) % step is the same. – Borut Flis Oct 11 '20 at 15:32
  • 1
    Actually (len(tokens)-window_size) % step appears correct and (len(tokens)+ window_size) % step is not. I found counter example len 82, window 7 step 5. if you use first formula you get 0 remaining and if you use second you get 4. – Borut Flis Oct 11 '20 at 16:08
  • yes, you are totally correct! I changed in the answer. Thanks again! – tatarana Oct 12 '20 at 00:51