1

I used the following famous code for my sliding window through the tokenised text document:

def window(fseq, window_size):
    "Sliding window"
    it = iter(fseq)
    result = tuple(islice(it, 0, window_size, round(window_size/4)))
    if len(result) == window_size:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        result_list = list(result)
        yield result_list

when I want to call my function with window size less than 6, everything is ok, but when I increase it, the beginning of the text is cut.

For example:

c=['A','B','C','D','E', 'F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
print(list(window(c, 4)))
print(list(window(c, 8)))

Output:

[('A', 'B', 'C', 'D'), ['B', 'C', 'D', 'E'], ['C', 'D', 'E', 'F'], ['D', 'E', 'F', 'G'], ['E', 'F', 'G', 'H'], ['F', 'G', 'H', 'I'],...

[['C', 'E', 'G', 'I'], ['E', 'G', 'I', 'J'], ['G', 'I', 'J', 'K'], ['I', 'J', 'K', 'L'], ['J', 'K', 'L', 'M']...

What's wrong? And why in the first output the first element is in round brackets?

My expected output for print(list(window(c, 8))) is:

[['A','B','C', 'D', 'E', 'F','G','H'], ['C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], ['E', 'F', 'G', 'H', 'I', 'K', 'L', 'M']...
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
Polly
  • 1,057
  • 5
  • 14
  • 23

1 Answers1

4

Your version is incorrect. It adds a 4th argument (the step size) to the islice() function that limits how large the first slice taken is going to be:

result = tuple(islice(it, 0, window_size, round(window_size/4)))

For 4 or 5, round(window_size/4) produces 1, the default step size. But for larger values, this produces a step size that guarantees that values will be omitted from that first window, so the next test, len(result) == window_size is guaranteed to be false.

Remove that step size argument, and you'll get your first window back again. Also see Rolling or sliding window iterator in Python.

The first result is in 'round brackets' because it is a tuple. If you wanted a list instead, use list() rather than tuple() in your code.

If you wanted to have your window slide along in steps larger than 1, you should not alter the initial window. You need to add and remove step size elements from the window as you iterate along. That's easier done with a while loop:

def window_with_larger_step(fseq, window_size):
    """Sliding window

    The step size the window moves over increases with the size of the window.
    """
    it = iter(fseq)
    result = list(islice(it, 0, window_size))
    if len(result) == window_size:
        yield result
    step_size = max(1, int(round(window_size / 4)))  # no smaller than 1
    while True:
        new_elements = list(islice(it, step_size))
        if len(new_elements) < step_size:
            break
        result = result[step_size:] + list(islice(it, step_size))
        yield result

This adds step_size elements to the running result, removing step_size elements from the start to keep the window size even.

Demo:

>>> print(list(window_with_larger_step(c, 6)))
[['A', 'B', 'C', 'D', 'E', 'F'], ['C', 'D', 'E', 'F', 'I', 'J'], ['E', 'F', 'I', 'J', 'M', 'N'], ['I', 'J', 'M', 'N', 'Q', 'R'], ['M', 'N', 'Q', 'R', 'U', 'V'], ['Q', 'R', 'U', 'V', 'Y', 'Z']]
>>> print(list(window_with_larger_step(c, 8)))
[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], ['C', 'D', 'E', 'F', 'G', 'H', 'K', 'L'], ['E', 'F', 'G', 'H', 'K', 'L', 'O', 'P'], ['G', 'H', 'K', 'L', 'O', 'P', 'S', 'T'], ['K', 'L', 'O', 'P', 'S', 'T', 'W', 'X'], ['O', 'P', 'S', 'T', 'W', 'X']]
>>> print(list(window_with_larger_step(c, 10)))
[['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], ['D', 'E', 'F', 'G', 'H', 'I', 'J', 'N', 'O', 'P'], ['G', 'H', 'I', 'J', 'N', 'O', 'P', 'T', 'U', 'V'], ['J', 'N', 'O', 'P', 'T', 'U', 'V', 'Z']]
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343