3

I'm trying to implement a sliding/moving window approach on lines of a csv file using Python. Each line can have a column with a binary value yes or no. Basically, I want to rare yes noises. That means if say we have 3 yes lines in a window of 5 (max of 5), keep them. But if there is 1 or 2, let's change them to no. How can I do that?

For instance, the following yes should both become no.

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,no,0.20
...

But in the followings, we keep as is (there can be a window of 5 where 3 of them are yes):

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,yes,0.20
...

I attempted writing something, having a window of 5, but got stuck (it is not complete):

        window_size = 5 
        filename='C:\\Users\\username\\v3\\And-'+v3file.split("\\")[5]
        with open(filename) as fin:
            with open('C:\\Users\\username\\v4\\And2-'+v3file.split("\\")[5],'w') as finalout:
                line= fin.readline()
                index = 0
                sequence= []
                accs=[]
                while line:
                    print(line)
                    for i in range(window_size):
                        line = fin.readline()
                        sequence.append(line)
                    index = index + 1
                    fin.seek(index)
Tina J
  • 4,983
  • 13
  • 59
  • 125
  • 1
    Are you trying to solve, keeping the most recent three rows an a variable/window? – wwii Dec 17 '19 at 19:17
  • @wwii Actually let's say max out of a window of 5 (3 yes not necessarily need to be all in sequence). Updated the question a bit. – Tina J Dec 17 '19 at 19:22
  • 1
    Is the file very large? Is it important to read one line of the file at a time? If you read the entire file into memory, your problem becomes easier and code will become cleaner, and you don't have to do things like `fin.seek` – vasia Dec 17 '19 at 19:22
  • Can you provide a more complete sample, and what the subsequent output should look like? – PMende Dec 17 '19 at 19:22
  • @vasia file can be up to 10MB. But if you think it fits memory, then fine. – Tina J Dec 17 '19 at 19:24
  • @PMende added another example. I think it is clear enough now. – Tina J Dec 17 '19 at 19:27

2 Answers2

4

You can use collections.deque with the maxlen argument set to the desired window size to implement a sliding window that keeps track of the yes/no flags of the most recent 5 rows. Keep a count of yeses instead of calculating the sum of yeses in the sliding window in every iteration to be more efficient. When you have a full-size sliding window and the count of yeses is greater than 2, add the line indices of these yeses to a set where the yeses should be kept as-is. And the in the second pass after resetting the file pointer of the input, alter the yeses to noes if the line indices are not in the set:

from collections import deque

window_size = 5
with open(filename) as fin, open(output_filename, 'w') as finalout:
    yeses = 0
    window = deque(maxlen=5)
    preserved = set()
    for index, line in enumerate(fin):
        window.append('yes' in line)
        if window[-1]:
            yeses += 1
        if len(window) == window_size:
            if yeses > 2:
                preserved.update(i for i, f in enumerate(window, index - window_size + 1) if f)
            if window[0]:
                yeses -= 1
    fin.seek(0)
    for index, line in enumerate(fin):
        if index not in preserved:
            line = line.replace('yes', 'no')
        finalout.write(line)

Demo: https://repl.it/@blhsing/StripedCleanCopyrightinfringement

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Thanks. Any chance **not** to use `csv`? Let's just generalize to a text file where `yes` exists in a line. I didn't add csv in my title to make it general. – Tina J Dec 17 '19 at 21:38
  • 1
    Edited accordingly then. – blhsing Dec 17 '19 at 21:42
  • Trying now...Hopefully no issues. – Tina J Dec 17 '19 at 21:43
  • 1
    @TinaJ Since it's actually a CSV, I don't think telling people to generalize is going to help give you the correct solution. It's pretty trivial to read the CSV correctly and doesn't add any complexity (if anything, it reduces complexity). – ggorlen Dec 17 '19 at 21:54
  • @ggorlen I see. Thanks. @blhsing Now if I want to get max of 10 rows (window size of 10) what numbers should I exactly change? I see many numbers inside. Can you make `window_size=5` a variable and adapt other numbers accordingly? – Tina J Dec 17 '19 at 21:56
  • 1
    Edited accordingly then. – blhsing Dec 17 '19 at 21:57
  • 1
    Great, it works. And `yeses > 2` was supposed to be the max count of the whole window, but already know how to edit that. So Thanks! – Tina J Dec 17 '19 at 22:00
  • please take a look at my new question: https://stackoverflow.com/questions/59402149 – Tina J Dec 19 '19 at 02:10
0

Here is a 5-liner solution based on building successive list comprehensions:

lines = [
'1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,yes,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']

n = len(lines)

# flag all lines containing 'yes' (add 2 empty lines at boundaries to avoid pbs)
flags = [line.count('yes') for line in ['', '']+lines+['', '']]
# count number of flags in sliding window [p-2,p+2]
counts = [sum(flags[p-2:p+3]) for p in range(2,n+2)]
# tag lines that need to be changed
tags = [flag > 0 and count < 3 for (flag,count) in zip(flags[2:],counts)]
# change tagged lines
for n in range(n):
  if tags[n]: lines[n] = lines[n].replace('yes','no')

print(lines)

Result:

['1,a1,b1,no,0.75',
 '2,a2,b2,yes,0.45',
 '3,a3,b3,yes,0.98',
 '4,a4,b4,yes,0.22',
 '5,a5,b5,no,0.46',
 '6,a6,b6,no,0.98',
 '7,a7,b7,no,0.22',
 '8,a8,b8,no,0.46',
 '9,a9,b9,no,0.20']

EDIT : As you read your data from a standard text file, all you have to do is:

with file(filename,'r') as f:
  lines = f.read().strip().split('\n')

(strip to remove potential blank lines at top or bottom on file, split(\n) to turn file content into a list of lines) then use the code above...

sciroccorics
  • 2,357
  • 1
  • 8
  • 21