1

I have a list of dates called dates:

From this list I want to remove the dates which belong to a range of three or more consecutive days. These are the dates I indented in the list.

What is the fastest way to do this?

[datetime.date(2018, 7, 2),
 datetime.date(2018, 7, 5),
 datetime.date(2018, 7, 7),
     datetime.date(2018, 7, 15),
     datetime.date(2018, 7, 16),
     datetime.date(2018, 7, 17),
 datetime.date(2018, 7, 29),
 datetime.date(2018, 8, 13),
 datetime.date(2018, 8, 27),
 datetime.date(2018, 9, 19),
 datetime.date(2018, 10, 25),
 datetime.date(2018, 11, 9),
     datetime.date(2018, 12, 21),
     datetime.date(2018, 12, 22),
     datetime.date(2018, 12, 23),
     datetime.date(2018, 12, 24),
     datetime.date(2018, 12, 25),
     datetime.date(2019, 1, 2),
     datetime.date(2019, 1, 3),
     datetime.date(2019, 1, 4),
     datetime.date(2019, 1, 5),
     datetime.date(2019, 1, 6),
     datetime.date(2019, 1, 7),
     datetime.date(2019, 1, 8),
     datetime.date(2019, 2, 27),
     datetime.date(2019, 2, 28),
     datetime.date(2019, 3, 1),
     datetime.date(2019, 3, 2),
     datetime.date(2019, 3, 3),
 datetime.date(2019, 3, 6),
     datetime.date(2019, 3, 11),
     datetime.date(2019, 3, 12),
     datetime.date(2019, 3, 13),
     datetime.date(2019, 3, 14),
 datetime.date(2019, 3, 16),
 datetime.date(2019, 3, 25),
 datetime.date(2019, 3, 27),
 datetime.date(2019, 3, 29),
 datetime.date(2019, 3, 30),
 datetime.date(2019, 4, 8)]

So the expected outcome after the dates, which belong to a range of three or more consecutive days, are deleted should be:

[datetime.date(2018, 7, 2),
 datetime.date(2018, 7, 5),
 datetime.date(2018, 7, 7),
 datetime.date(2018, 7, 29),
 datetime.date(2018, 8, 13),
 datetime.date(2018, 8, 27),
 datetime.date(2018, 9, 19),
 datetime.date(2018, 10, 25),
 datetime.date(2018, 11, 9),
 datetime.date(2019, 3, 6),
 datetime.date(2019, 3, 16),
 datetime.date(2019, 3, 25),
 datetime.date(2019, 3, 27),
 datetime.date(2019, 3, 29),
 datetime.date(2019, 3, 30),
 datetime.date(2019, 4, 8)]
Nelis
  • 91
  • 1
  • 11

6 Answers6

2

My solution is the following:

import datetime

dates = [datetime.date(2018, 7, 2),
         datetime.date(2018, 7, 5),
         ...,
         datetime.date(2019, 3, 30),
         datetime.date(2019, 4, 8)]


def are_consecutive(d1, d2):
    return d2-d1 == datetime.timedelta(1)

filtered_out = set()
consecutive = set()
for i,d in enumerate(sorted(dates)):
    try:
        d1,d2 = dates[i:i+2]
    except:
        break
    if are_consecutive(d1, d2):
        consecutive.add(d1)
        consecutive.add(d2)
    else:
        if len(consecutive) >= 3:
            for date in consecutive:
                filtered_out.add(date)
        consecutive = set()

selected = [d for d in dates if d not in filtered_out]

selected is:

[datetime.date(2018, 7, 2),
 datetime.date(2018, 7, 5),
 datetime.date(2018, 7, 7),
 datetime.date(2018, 7, 29),
 datetime.date(2018, 8, 13),
 datetime.date(2018, 8, 27),
 datetime.date(2018, 9, 19),
 datetime.date(2018, 10, 25),
 datetime.date(2018, 11, 9),
 datetime.date(2019, 3, 6),
 datetime.date(2019, 3, 16),
 datetime.date(2019, 3, 25),
 datetime.date(2019, 3, 27),
 datetime.date(2019, 3, 29),
 datetime.date(2019, 3, 30),
 datetime.date(2019, 4, 8)]

Which is correct if you consider 27 Feb, 28 Feb and 1 Mar 2019 to be consecutive, which they are!

To briefly explain the code: are_consecutive() simply checks if two dates are consecutive. If so their difference should return datetime.timedelta(1). I use this function to check each date with the next one. The dates are sorted at the beginning of the loop just to be sure of their order. If the dates are consecutive, they get stored in the consecutive set, if not, then I check how many consecutive dates have been stored so far. If it's 3 or more, then save the result in the filtered_out set, otherwise not. consecutive get reset every time two dates are not consecutive.

alec_djinn
  • 10,104
  • 8
  • 46
  • 71
1

My answer is as below:

import datetime
import numpy as np

dates = [datetime.date(2018, 7, 2),
         datetime.date(2018, 7, 5),
         ......
         datetime.date(2019, 4, 8)]

dates = np.array(dates)
inds = np.ones_like(dates, np.bool)

i = 0
while i < len(dates) - 1:
    datei = dates[i]
    for j in range(i + 1, len(dates)):
        datej = dates[j]
        if datei + datetime.timedelta(j - i) != datej:
            break
    if j - i >= 3:
        inds[range(i, j)] = False

    i = j

dates = dates[inds]
print(dates)

output:

[datetime.date(2018, 7, 2) datetime.date(2018, 7, 5)
 datetime.date(2018, 7, 7) datetime.date(2018, 7, 29)
 datetime.date(2018, 8, 13) datetime.date(2018, 8, 27)
 datetime.date(2018, 9, 19) datetime.date(2018, 10, 25)
 datetime.date(2018, 11, 9) datetime.date(2019, 3, 6)
 datetime.date(2019, 3, 16) datetime.date(2019, 3, 25)
 datetime.date(2019, 3, 27) datetime.date(2019, 3, 29)
 datetime.date(2019, 3, 30) datetime.date(2019, 4, 8)]
ToughMind
  • 987
  • 1
  • 10
  • 28
  • The output is not completely correct: as for example the `datetime.date(2018, 7, 15), datetime.date(2018, 7, 16), datetime.date(2018, 7, 17)` are still in the output whereas these dates are three consecutive dates and have to be deleted... – Nelis Aug 09 '19 at 09:18
  • I have correct this, and `datetime.date(2019, 2, 27)` and `datetime.date(2019, 2, 28)` should not in the output. – ToughMind Aug 09 '19 at 09:21
  • I think my answer may have advantage in speed. Because I use numpy indexs method to delete things and won't do any calculation twice. – ToughMind Aug 09 '19 at 09:25
  • @ToughMind The for loop nested in the while loop is not fast at all. Your solution is about 50% slower than mine that uses a single for loop. – alec_djinn Aug 09 '19 at 10:01
  • I have made a little benchmark, and from my tests, @alec_djinn is 3 times faster. – IMCoins Aug 09 '19 at 10:08
1

The print are unfortunately too long to print, so I guess I'll just leave the answer with the comments. Feel free to try the code and tell me if I forgot an edge case. ;)

The DataFrame must be sorted in ascending order.

    #   Creating the DataFrame. Deleting some dates in order to have some that
    #   are not consecutives and isolated.
    df = pd.DataFrame({
        'date' : pd.date_range(start='01/01/2018', end='31/01/2018')
    })
    df = df.loc[ ~df.index.isin([1, 3, 5, 10, 12, 15, 25]) ]

    #   First : Count the consecutive days.
    #   Take the difference of each days, and make a boolean mask
    #   of those who have a difference not equal to 1.
    #   We now have False where the difference is 1, and True where it is not.
    #   The cumulative sum gives us 'groups' of consecutive dates.
    df['range_count'] = df['date'].diff().dt.days.ne(1).cumsum()

    #   Use the previous groups and count the number of items in each group.
    #   I use transform to apply the group counts to each row.
    df['check'] = df.groupby('range_count')['date'].transform('count')

    #   Then, the select is easy.
    print(
        df.loc[df['check'] < 3, 'date']
    )
    # 0    2018-01-01
    # 2    2018-01-03
    # 4    2018-01-05
    # 11   2018-01-12
    # 13   2018-01-14
    # 14   2018-01-15
IMCoins
  • 3,149
  • 1
  • 10
  • 25
0

Assuming dates is the list you provide, in ascending order, the following code:

j = 0                        # index of the date checked for consecutives
while j < len(dates):
    date = dates[j]          # the date checked for consecutives
    i = 1                    # counter of consecutive days in the list
    j += 1 
    while True:              # count consecutive days and delete when 3 or more found
        date = date + datetime.timedelta(days=1) # check if the following day is in the list
        if date in dates:    # if found in the list then:
            i += 1               # count it and check for the next.
        else:                # if not in the list then:
            if i > 2:            # if 3 or more consecutive dates are found
                del dates[j-1:j+i-1]   # delete them from list
            break
print(dates)

has the desired output:

[datetime.date(2018, 7, 2), datetime.date(2018, 7, 5), datetime.date(2018, 7, 7), datetime.date(2018, 7, 29), datetime.date(2018, 8, 13), datetime.date(2018, 8, 27), datetime.date(2018, 9, 19), datetime.date(2018, 10, 25), datetime.date(2018, 11, 9), datetime.date(2019, 1, 2), datetime.date(2019, 2, 27), datetime.date(2019, 3, 6), datetime.date(2019, 3, 16), datetime.date(2019, 3, 25), datetime.date(2019, 3, 27), datetime.date(2019, 3, 29), datetime.date(2019, 3, 30), datetime.date(2019, 4, 8)]
Neo
  • 627
  • 3
  • 7
0

Inspired by this post. If you first find all consecutive days, group this consecutive periods and finally find periods with 3 or more consecutive periods.

s = pd.Series([
     datetime.date(2018, 7, 2),
     datetime.date(2018, 7, 5),
     datetime.date(2018, 7, 7),
     ...
    ])

# Define 1 day difference
day = pd.Timedelta('1d')

# Find all consecutive days  
consecutive_days = ((s - s.shift(-1)).abs() == day) | ((s.diff() == day))
consecutive_groups = (s.diff() != day).cumsum() # group into consecutive periods

# Find groups with 3 or more consecutive days
unique, count = np.unique(consecutive_groups , return_counts=True)
s[~consecutive_groups .isin(unique[count >= 3])].tolist()

This returns the following.

[datetime.date(2018, 7, 2),
 datetime.date(2018, 7, 5),
 datetime.date(2018, 7, 7),
 datetime.date(2018, 7, 29),
 datetime.date(2018, 8, 13),
 datetime.date(2018, 8, 27),
 datetime.date(2018, 9, 19),
 datetime.date(2018, 10, 25),
 datetime.date(2018, 11, 9),
 datetime.date(2019, 3, 6),
 datetime.date(2019, 3, 16),
 datetime.date(2019, 3, 25),
 datetime.date(2019, 3, 27),
 datetime.date(2019, 3, 29),
 datetime.date(2019, 3, 30),
 datetime.date(2019, 4, 8)]
Josmoor98
  • 1,721
  • 10
  • 27
0

My take on this:

unconsecutive_dates = []
previous = None
for d in sorted(dates):
    if unconsecutive_dates and d == unconsecutive_dates[-1] + datetime.timedelta(days=1):
        unconsecutive_dates.pop()
    elif previous != d - datetime.timedelta(days=1):
        unconsecutive_dates.append(d)
    previous = d
Hrabal
  • 2,403
  • 2
  • 20
  • 30