1

Sorry if this is a duplicate, but I could not find any solution to my problem.

I am looping through many dates, and i want to get rid of the dates where they are not in a sequence:

days = []
new_rows = []

for row in df.iterrows():

    date = row[1][0]
    date_init_input = date.replace("-", " ")
    date_num = datetime.datetime.strptime(date_init_input, '%Y %m %d').weekday()

    counter = 0

    if len(days) == 5:
        for day in days:
            if day == counter:

                print("Correct sequence " + new_rows[counter][1][0] + " " + findDay(new_rows[counter][1][0]))
                counter += 1

                if day == 4:
                    days.clear()
                    new_rows.clear()
            else:
                print("No sequence " + new_rows[counter][1][0] + " " + findDay(new_rows[counter][1][0]))

                modDf = df.drop(new_rows[counter][0])
                days.clear()
                new_rows.clear()

    else:
        print("No sequence " + date + " " + findDay(date) + " BBBBBBBBBBB")
        days.append(date_num)
        new_rows.append(row)

The issue here is that the loop only moves five indexes forward, which means that any sequence between two checks get lost.

Simplified question

Lets say I have an array like this:

[0, 1, 2, 3, 4, 0, 1, 2, 4, 0, 0, 1, 2, 3, 4]

I want to remove where the numbers are not a part of a specific sequence of length 5. I want my array to look like this:

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

If you want further explanation please ask:)

Elias Knudsen
  • 315
  • 2
  • 9

4 Answers4

0

I suggest a different approach. You can use a temporary list to keep track of a pattern match. While looping through the dates, append days to the temporary list as long as they are compliant with the sequence. As soon as they diverge from the sequence, simply empty the temporary list to zero. If they complete the sequence, add the list onto your resulting list (new_rows in your case).

So, in pseudocode:

result = []
pattern = [1, 2, 3, 4, 5]
temp_list = []

for day in alldays:
    if day == pattern[len(temp_list)]
        temp_list.append(day)
    if len(temp_list) == len(pattern)
        result = result + temp_list
        temp_list = []
Leon
  • 171
  • 4
0

A straight-forward method based on your simplified version but relies on strings, join everything to a string and search for exact matches of what you need:

import re

seq = [0, 1, 2, 3, 4, 0, 1, 2, 4, 0, 0, 1, 2, 3, 4]
seq_s = ','.join([str(i) for i in seq])

search = '0,1,2,3,4'

Use re.finditer to return non-overlapping matches, and split the results, for added efficiency we can keep the found_sequence as a generator untill we decide what needs to be done:

found_seq = (m[0].split(',') for m in re.finditer(search, seq_s))

for i in found_seq:
    print(i)

Output:
['0', '1', '2', '3', '4']
['0', '1', '2', '3', '4']

To return it into a list:

found_list = []

for i in found_seq:
    #do something like...
    i = [int(n) for n in i]
    found_list.extend(i)

found_list

Output:
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
BernardL
  • 5,162
  • 7
  • 28
  • 47
0

(EDITED): Here are a couple of ways of achieving this:

def remove_non_standard_buffer(items, template):
    buffer = []
    len_template = len(template)
    j = 0
    for item in items:
        if item == template[j] and j < len_template:
            buffer.append(item)
            j += 1
        elif item == template[0]:
            buffer = [item]
            j = 1
        else:
            buffer = []
            j = 0
        if len(buffer) == len_template:
            for buffer_item in buffer:
                yield buffer_item
            buffer = []
            j = 0


def remove_non_standard_slicing(items, template):
    start = 0
    end = len(template)
    for item in items:
        test_seq = items[start:end]
        if test_seq == template:
            yield from template
        end += 1
        start += 1


def remove_non_standard_for(items, template):
    len_template = len(template)
    for i, item in enumerate(items):
        if items[i:i + len_template] == template:
            yield from template


def remove_non_standard_while(items, template):
    len_template = len(template)
    len_items = len(items)
    i = 0
    while i < len_items - len_template + 1:
        if items[i:i + len_template] == template:
            yield from template
            i += len_template
        else:
            i += 1


def remove_non_standard_while_reverse(items, template):
    i = 0
    len_template = len(template)
    len_items = len(items)
    while i < len_items - len_template + 1:
        to_yield = True
        for j in range(len_template - 1, -1, -1):
            if items[i + j] != template[j]:
                to_yield = False
                break
        if to_yield:
            yield from template
            i += len_template
        else:
            i += j + 1

def remove_non_standard_count(items, template):
    n = 0
    i = 0
    len_template = len(template)
    len_items = len(items)
    while i < len_items - len_template + 1:
        if items[i:i + len_template] == template:
            n += 1
            i += len_template
        else:
            i += 1
    return template * n


def remove_non_standard_count_reverse(items, template):
    n = 0
    i = 0
    len_template = len(template)
    len_items = len(items)
    while i < len_items - len_template + 1:
        to_yield = True
        for j in range(len_template - 1, -1, -1):
            if items[i + j] != template[j]:
                to_yield = False
                break
        if to_yield:
            n += 1
            i += len_template
        else:
            i += j + 1
    return template * n

and testing it:

ll = [0, 1, 2, 3, 4, 0, 1, 2, 4, 0, 0, 1, 2, 3, 4]
print(list(remove_non_standard_buffer(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_reverse(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_slicing(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_for(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_while(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
print(list(remove_non_standard_while_reverse(ll, [0, 1, 2, 3, 4])))
# [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

with the respective timings:

%timeit list(remove_non_standard_buffer(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 3.35 ms per loop
%timeit list(remove_non_standard_slicing(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 3.35 ms per loop
%timeit list(remove_non_standard_for(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 3.19 ms per loop
%timeit list(remove_non_standard_while(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 2.29 ms per loop
%timeit list(remove_non_standard_while_reverse(ll * 1000, [0, 1, 2, 3, 4]))
# 100 loops, best of 3: 2.52 ms per loop
%timeit remove_non_standard_count(ll * 1000, [0, 1, 2, 3, 4])
# 100 loops, best of 3: 1.85 ms per loop
%timeit remove_non_standard_count_reverse(ll * 1000, [0, 1, 2, 3, 4])
# 100 loops, best of 3: 2.13 ms per loop

remove_non_standard_slicing() uses substantially the same approach as @EliasKnudsen answer, but the approach with the remove_non_standard_while() is considerably faster. remove_non_standard_while_reverse() is even more efficient, but pays the relatively inefficient looping in Python.

Instead, the _count solutions are a somewhat over-optimized for list version of while that take advantage of the faster list multiplication operations (and therefore it is probably less useful for pandas dataframes).

norok2
  • 25,683
  • 4
  • 73
  • 99
  • @EliasKnudsen perhaps you may want to look into some of these approaches, which may be faster for your problem. – norok2 Jan 29 '20 at 10:19
0

This is what i ended up using, i feel this is an unefficient approach, but it works:

array = [1, 2, 3, 4, 5, 1, 2, 2, 1, 2, 3, 4, 5]
pattern = [1, 2, 3, 4, 5]

end = 5
start = 0

for i in array:
    new_array = array[start:end]
    if new_array == pattern:
        print(str(new_array) + " Correct")
    else:
        print(str(new_array) + " False")

    end += 1
    start += 1
Elias Knudsen
  • 315
  • 2
  • 9