4

I'm from biology and very new to python and ML, the lab has a blackbox ML model which outputs a sequence like this :

Predictions =
[1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,1,1,1,1,1,0]  

each value represents a predicted time frame of duration 0.25seconds.
1 means High.
0 means Not High.

How do I convert these predictions into a [start,stop,label] ?
so that longer sequences are grouped example the first 10 ones represent 0 to 10*.25s thus the first range and label would be

[[0.0,2.5, High]
next there are 13 zeroes ===> start = (2.5), stop = 13*.25 +2.5, label = Not high
thus
[2.5, 5.75, Not-High]

so final list would be something like a list of lists/ranges with unique non overlapping intervals along with a label like :

[[0.0,2.5, High],
[2.5, 5.75, Not-High],
[5.75,6.50, High] ..

What I tried:
1. Count number of values in Predictions
2. Generate two ranges , one starting at zero and another starting at 0.25
3. merge these two lists into tuples

import numpy as np  
len_pred = len(Predictions) 
range_1 = np.arange(0,len_pred,0.25)
range_2 = np.arange(0.25,len_pred,0.25)
new_range = zip(range_1,range_2)  

Here I'm able to get the ranges, but missing out on the labels.
Seems like simple problem but I'm running in circles.

Please advise. Thanks.

Seirra
  • 131
  • 7
  • 1
    How attached are you to that output style? From `numpy` you could very easily (and comparatively quickly) get `array([0.0, 2.5, 5.75, 6.5 . . . .])` as the transtition times and the labels can be generated as just an alternating sequence of `['High', 'Not-High', . . . ]`. But if you want a list of lists while mixing floats with strings you'll pretty much be stuck with base `python` methods (and slow `for` loops). – Daniel F Feb 21 '18 at 06:51
  • 1
    For instance, see @Divakar's answer [here](https://stackoverflow.com/questions/47750593/finding-false-true-transitions-in-a-numpy-array) or DilithiumMatrix's answer [here](https://stackoverflow.com/questions/36894822/how-do-i-identify-sequences-of-values-in-a-boolean-array) – Daniel F Feb 21 '18 at 06:57
  • 1
    Two xamples below at time of writing give you the answer in the format asked for. If you are interested in doing it yourself then half the problem is knowing what to look for. The manipulation you want is basically runlength encoding: https://en.wikipedia.org/wiki/Run-length_encoding; followed by a particular formatting of the runs. – Paddy3118 Feb 21 '18 at 08:43
  • Let me try above suggestions. thank you. @DanielF I'm open to the best possible solutions, based on my limited exposure to python, I could think of this [float,float,string] kind of output. It is fine even if I can have the ranges on one side , and the corresponding labels separately. I just read that I can use HSTACK from numpy to combine the ranges and the labels. Paddy3118 thank you for your insights, let me go through the links you provided. – Seirra Feb 21 '18 at 15:39

3 Answers3

4

You can iterate through the list and create a range when you detect a change. You'll also need to account for the final range when using this method. Might not be super clean but should be effective.

current_time = 0
range_start = 0
current_value = predictions[0]
ranges = []
for p in predictions:
  if p != current_value:
    ranges.append([range_start, current_time, 'high' if current_value == 1 else 'not high'])
    range_start = current_time
    current_value = p
  current_time += .25
ranges.append([range_start, current_time, 'high' if current_value == 1 else 'not high'])

Updated to fix a few off by one type errors.

Steve
  • 939
  • 1
  • 6
  • 20
  • could you guide me here : https://stackoverflow.com/questions/49176702/conditionally-replace-values-in-one-list-using-another-list-of-different-length – Seirra Mar 08 '18 at 19:19
4

by using diff() and where() you can find all the index that the value changed:

import numpy as np

p = np.array([1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,1,1,1,1,1,1,0])

idx = np.r_[0, np.where(np.diff(p) != 0)[0]+1, len(p)]
t = idx * 0.25

np.c_[t[:-1], t[1:], p[idx[:-1]]]

output:

array([[  0.  ,   2.5 ,   1.  ],
       [  2.5 ,   5.75,   0.  ],
       [  5.75,   6.5 ,   1.  ],
       [  6.5 ,   6.75,   0.  ],
       [  6.75,   7.  ,   1.  ],
       [  7.  ,   7.25,   0.  ],
       [  7.25,   7.5 ,   1.  ],
       [  7.5 ,   7.75,   0.  ],
       [  7.75,   8.  ,   1.  ],
       [  8.  ,   8.25,   0.  ],
       [  8.25,   9.5 ,   1.  ],
       [  9.5 ,  10.25,   0.  ],
       [ 10.25,  11.75,   1.  ],
       [ 11.75,  12.  ,   0.  ]])
HYRY
  • 94,853
  • 25
  • 187
  • 187
3

If I understood you correctly I think something like that should work.

compact_prediction = list()
sequence = list()  # This will contain each sequence list [start, end, label]

last_prediction = 0

for index, prediction in enumerate(Predictions):
    if index == 0:
        sequence.append(0)  # It's the first sequence, so it will start in zero

    # When we not talking about the prediction we only end the sequence
    # when the last prediction is different from the current one, 
    # signaling a change
    elif prediction != last_prediction:
        sequence.append((index - 1) * 0.25) # We append the end of the sequence

        # And we put the label based on the last prediction
        if last_prediction == 1:  
            sequence.append('High')
        else:
            sequence.append('Not-High')

        # Append to our compact list and reset the sequence
        compact_prediction.append(sequence)
        sequence= list()

        # After reseting the sequence we append the start of the new one
        sequence.append(index * 0.25)

    # Save the last prediction so we can check if it changed
    last_prediction = prediction

print(compact_prediction)

Result: [[0.0, 2.25, 'High'], [2.5, 5.5, 'Not-High'], [5.75, 6.25, 'High'], [6.5, 6.5, 'Not-High'], [6.75, 6.75, 'High'], [7.0, 7.0, 'Not-High'], [7.25, 7.25, 'High'], [7.5, 7.5, 'Not-High'], [7.75, 7.75, 'High'], [8.0, 8.0, 'Not-High'], [8.25, 9.25, 'High'], [9.5, 10.0, 'Not-High'], [10.25, 11.5, 'High']]

forayer
  • 367
  • 2
  • 10