How to filter out elements of an array that is not in a specific sequence in python

Question

Sorry if this is a duplicate, but I could not find any solution to my problem.

I am looping through many dates, and i want to get rid of the dates where they are not in a sequence:

days = []
new_rows = []

for row in df.iterrows():

    date = row[1][0]
    date_init_input = date.replace("-", " ")
    date_num = datetime.datetime.strptime(date_init_input, '%Y %m %d').weekday()

    counter = 0

    if len(days) == 5:
        for day in days:
            if day == counter:

                print("Correct sequence " + new_rows[counter][1][0] + " " + findDay(new_rows[counter][1][0]))
                counter += 1

                if day == 4:
                    days.clear()
                    new_rows.clear()
            else:
                print("No sequence " + new_rows[counter][1][0] + " " + findDay(new_rows[counter][1][0]))

                modDf = df.drop(new_rows[counter][0])
                days.clear()
                new_rows.clear()

    else:
        print("No sequence " + date + " " + findDay(date) + " BBBBBBBBBBB")
        days.append(date_num)
        new_rows.append(row)

The issue here is that the loop only moves five indexes forward, which means that any sequence between two checks get lost.

Simplified question

Lets say I have an array like this:

[0, 1, 2, 3, 4, 0, 1, 2, 4, 0, 0, 1, 2, 3, 4]

I want to remove where the numbers are not a part of a specific sequence of length 5. I want my array to look like this:

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

If you want further explanation please ask:)

What about `0, 1, 2`? The numbers are in sequence too. Or are you looking only for sequences of length `5`? — Andrej Kesely, Jan 25 '20 at 00:43
Do we have to increment the numbers by 1 every time? E.g. would 0,2,4,6,8 be valid? — twerk_it_606, Jan 25 '20 at 00:46
Does this answer your question? [String subpattern recognition optimization](https://stackoverflow.com/questions/58507418/string-subpattern-recognition-optimization) — norok2, Jan 25 '20 at 00:53
@norok2 That question only returns true if there is a pattern, i know that there is a pattern here, but i want to filter them out. — Elias Knudsen, Jan 25 '20 at 00:58

Leon · Answer 1 · 2020-01-25T01:23:12.937

0

I suggest a different approach. You can use a temporary list to keep track of a pattern match. While looping through the dates, append days to the temporary list as long as they are compliant with the sequence. As soon as they diverge from the sequence, simply empty the temporary list to zero. If they complete the sequence, add the list onto your resulting list (new_rows in your case).

So, in pseudocode:

result = []
pattern = [1, 2, 3, 4, 5]
temp_list = []

for day in alldays:
    if day == pattern[len(temp_list)]
        temp_list.append(day)
    if len(temp_list) == len(pattern)
        result = result + temp_list
        temp_list = []

edited Jan 25 '20 at 01:23

answered Jan 25 '20 at 01:12

Leon

171
4

What is len.append(day)? – Elias Knudsen Jan 25 '20 at 01:18
An errror, I fixed it. That line and the line above are responsible for adding the day to the temporary list as long as it fits with the pattern. – Leon Jan 25 '20 at 01:24

score 0 · Answer 2 · answered Jan 25 '20 at 01:30

A straight-forward method based on your simplified version but relies on strings, join everything to a string and search for exact matches of what you need:

import re

seq = [0, 1, 2, 3, 4, 0, 1, 2, 4, 0, 0, 1, 2, 3, 4]
seq_s = ','.join([str(i) for i in seq])

search = '0,1,2,3,4'

Use re.finditer to return non-overlapping matches, and split the results, for added efficiency we can keep the found_sequence as a generator untill we decide what needs to be done:

found_seq = (m[0].split(',') for m in re.finditer(search, seq_s))

for i in found_seq:
    print(i)

Output:
['0', '1', '2', '3', '4']
['0', '1', '2', '3', '4']

To return it into a list:

found_list = []

for i in found_seq:
    #do something like...
    i = [int(n) for n in i]
    found_list.extend(i)

found_list

Output:
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

norok2 · Answer 3 · 2020-01-29T10:17:40.740

(EDITED): Here are a couple of ways of achieving this:

def remove_non_standard_buffer(items, template):
    buffer = []
    len_template = len(template)
    j = 0
    for item in items:
        if item == template[j] and j < len_template:
            buffer.append(item)
            j += 1
        elif item == template[0]:
            buffer = [item]
            j = 1
        else:
            buffer = []
            j = 0
        if len(buffer) == len_template:
            for buffer_item in buffer:
                yield buffer_item
            buffer = []
            j = 0


def remove_non_standard_slicing(items, template):
    start = 0
    end = len(template)
    for item in items:
        test_seq = items[start:end]
        if test_seq == template:
            yield from template
        end += 1
        start += 1


def remove_non_standard_for(items, template):
    len_template = len(template)
    for i, item in enumerate(items):
        if items[i:i + len_template] == template:
            yield from template


def remove_non_standard_while(items, template):
    len_template = len(template)
    len_items = len(items)
    i = 0
    while i < len_items - len_template + 1:
        if items[i:i + len_template] == template:
            yield from template
            i += len_template
        else:
            i += 1


def remove_non_standard_while_reverse(items, template):
    i = 0
    len_template = len(template)
    len_items = len(items)
    while i < len_items - len_template + 1:
        to_yield = True
        for j in range(len_template - 1, -1, -1):
            if items[i + j] != template[j]:
                to_yield = False
                break
        if to_yield:
            yield from template
            i += len_template
        else:
            i += j + 1

def remove_non_standard_count(items, template):
    n = 0
    i = 0
    len_template = len(template)
    len_items = len(items)
    while i < len_items - len_template + 1:
        if items[i:i + len_template] == template:
            n += 1
            i += len_template
        else:
            i += 1
    return template * n


def remove_non_standard_count_reverse(items, template):
    n = 0
    i = 0
    len_template = len(template)
    len_items = len(items)
    while i < len_items - len_template + 1:
        to_yield = True
        for j in range(len_template - 1, -1, -1):
            if items[i + j] != template[j]:
                to_yield = False
                break
        if to_yield:
            n += 1
            i += len_template
        else:
            i += j + 1
    return template * n

and testing it:

ll = [0, 1, 2, 3, 4, 0, 1, 2, 4, 0, 0, 1, 2, 3, 4]
print(list(remove_non_standard_buffer(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_reverse(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_slicing(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_for(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_while(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_while_reverse(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

with the respective timings:

%timeit list(remove_non_standard_buffer(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 3.35 ms per loop
%timeit list(remove_non_standard_slicing(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 3.35 ms per loop
%timeit list(remove_non_standard_for(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 3.19 ms per loop
%timeit list(remove_non_standard_while(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 2.29 ms per loop
%timeit list(remove_non_standard_while_reverse(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 2.52 ms per loop
%timeit remove_non_standard_count(ll * 1000, [0, 1, 2, 3, 4])
# 100 loops, best of 3: 1.85 ms per loop
%timeit remove_non_standard_count_reverse(ll * 1000, [0, 1, 2, 3, 4])
# 100 loops, best of 3: 2.13 ms per loop

remove_non_standard_slicing() uses substantially the same approach as @EliasKnudsen answer, but the approach with the remove_non_standard_while() is considerably faster. remove_non_standard_while_reverse() is even more efficient, but pays the relatively inefficient looping in Python.

Instead, the _count solutions are a somewhat over-optimized for list version of while that take advantage of the faster list multiplication operations (and therefore it is probably less useful for pandas dataframes).

@EliasKnudsen perhaps you may want to look into some of these approaches, which may be faster for your problem. — norok2, Jan 29 '20 at 10:19

score 0 · Accepted Answer · answered Jan 25 '20 at 15:52

This is what i ended up using, i feel this is an unefficient approach, but it works:

array = [1, 2, 3, 4, 5, 1, 2, 2, 1, 2, 3, 4, 5]
pattern = [1, 2, 3, 4, 5]

end = 5
start = 0

for i in array:
    new_array = array[start:end]
    if new_array == pattern:
        print(str(new_array) + " Correct")
    else:
        print(str(new_array) + " False")

    end += 1
    start += 1

How to filter out elements of an array that is not in a specific sequence in python

4 Answers4