Use simple counting method for certain criteria

Question

I have a dataset like

For this dataset I want to perform a task which will go through my dataset and will count the number of occurrences above a cutoff if the length of occurrence is above M.

The cutoff and M will be system arguments.

So if the cutoff is 0.32 and M is 1 it will print out a list like

[2, 4, 3, 2]

Logic: First two values in second column are above 0.32 and the length of the is greater than M=1 hence it printed out 2 and 4,3,2 so on.

I need a help to write the argument so that if x >cutoff and length of broken is >M it will print out the length of broken frames (so the same out put as above). Any help?

The structure should look like following (I am not sure how to place the argument in place of XXX)

def get_input(filename):
    with open(filename) as f:
        next(f) # skip the first line
        input_list = []
        for line in f:
            input_list.append(float(line.split()[1]))

    return input_list


def countwanted(input_list, wantbroken, cutoff,M):

    def whichwanted(x):
        if(wantbroken): return x > cutoff
        else: return x < cutoff

XXX I think here I need to add the criteria for M but not sure how?

filename=sys.argv[1]
wantbroken=(sys.argv[2]=='b' or sys.argv[2]=='B')
cutoff=float(sys.argv[3])
M=int(sys.argv[4])

input_list = get_input(filename)

broken,lifebroken=countwanted(input_list,True,cutoff,M)
#closed,lifeclosed=countwanted(input_list,False,cutoff,M)
print(lifebroken)
#print(lifeclosed)

Or maybe there is a simpler way to write it.

There's an awful amount of code. Could you show an expected output given the above input (and any additional input required)? — CristiFati, Aug 24 '18 at 18:38
@CristiFati [2, 4, 3, 2] if executed like python myscript.py test.dat b 0.32 1 where 0.32 is cutoff and M=1. this is the expected output — Roy Banerjee, Aug 24 '18 at 18:44

Mad Physicist · Accepted Answer · 2018-08-24T21:32:43.573

1

You are OK with using numpy, which makes life a lot easier.

First off, let's take a look at the file loader. np.loadtxt can do the same thing in one line.

y = np.loadtxt(filename, skiprows=1, usecols=1)

Now to create a mask of which values that make up your above-threshold runs:

b = (y > cutoff)  # I think you can figure out how to switch the sense of the test

The rest is easy, and based off this question:

b = np.r_[0, b, 0]       # pad the ends
d = np.diff(b)           # find changes in state
start, = np.where(d > 0) # convert switch up to start indices
end, = np.where(d < 0)   # convert switch down to end indices
len = end - start        # get the lengths

Now you can apply M to len:

result = len[len >= M]

If you want to work with lists, itertools.groupby also offers a good solution:

grouper = it.groupby(y, key=lambda x: x > cutoff)
result = [x for x in (len(list(group)) for key, group in grouper if key) if x >= M]

edited Aug 24 '18 at 21:32

answered Aug 24 '18 at 19:40

Mad Physicist

107,652
25
181
264

Well I put your command in a script `#!/usr/bin/python import numpy as np import sys y = np.loadtxt(filename, skiprows=1, usecols=1) b = (y > cutoff) b = np.r_[0, b, 0] d = np.diff(b) start, = np.where(d > 0) end, = np.where(d < 0) len = end - start result = len[len >= M] filename=sys.argv[1] cutoff=int(sys.argv[1]) M=int(sys.argv[2]) print(result)` and trying to run it like python broken_nogroupby.py test.dat 0.33 1 but it is saying NameError: name 'filename' is not defined Maybe I am doing a silly mistake. any help? – Roy Banerjee Aug 24 '18 at 20:26
am I inputting the filename, cutoff wrongly in the system argument? – Roy Banerjee Aug 24 '18 at 20:31
Well, you have to set filename somewhere. Keep your original argument processing. – Mad Physicist Aug 24 '18 at 21:16
Silly question when I am using print(result) it is giving [37 34 56 ..., 30 43 12] So it is doing the right thing but not printing out the full list but rather giving .... Any tricks to make it print the full list – Roy Banerjee Aug 25 '18 at 08:17
Okay I fixed it `def fullstring(k): return " ".join([str(x) for x in k]) pd.options.display.max_seq_items = 10000 print(fullstring(result))` – Roy Banerjee Aug 25 '18 at 08:21

Use simple counting method for certain criteria

1 Answers1