1

Title says most of it. i.e. Find the maximum consecutive Ones/1s (or Trues) for each year, and if the consecutive 1s at the end of a year continues to the following year, merge them together. I have tried to implement this, but seems a bit of a 'hack', and wondering if there is a better way to do it.

Reproducible Example Code:

# Modules needed
import pandas as pd
import numpy as np

# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)

InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean

# Wanted Output
# Year    MaxConsecutive-Ones
# 2000    9
# 2001    3

Below is my initial code to achieved wanted output

# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
    distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number 
    distinct = distinct[boolean_array] # only consider trues from the distinct values
    consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
    return consect

# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year    MaxConsecutive-Ones
# 2000    7
# 2001    3

However, output above is still not what we want because groupby function cuts the data for each year.

So below code we will try and 'fix' this by computing the MaxConsecutive-Ones at the boundaries (i.e. current_year-01-01 and previous_year-12-31), And if the MaxConsecutive-Ones at the boundaries are larger than compared to the original MaxConsecutive-Ones from above output then we replace it.

# First) we aquire all start_of_year  and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]

# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]

# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index. 
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year

# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index

# Finally) Compute the consecutive 1s/trues at the boundaries 
# for each matched years
for year in matched_years:
    # Compute the amount of consecutive 1s/trues at the start-of-year
    start = boolean_array.loc[boolean_array.index.year == (year + 1)]
    distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number 
    distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array. 
    count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
    start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)

    # Compute the amount of consecutive 1s/trues at the previous-end-of-year
    end = boolean_array.loc[boolean_array.index.year == year]
    distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number 
    distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
    count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
    end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)


    # Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
    ConsecutiveAtBoundaries = start_consecutive + end_consecutive

    # Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
    Modify_MaxConsecutive = MaxConsecutive.copy()
    if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
        Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
    else:
        None

# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year    MaxConsecutive-Ones
# 2000    9
# 2001    3
WDS
  • 337
  • 3
  • 16
  • Welcome to stack overflow! Please see this post on [how to create a good reproducible example in pandas](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples/20159305#20159305). Specifically, it is not clear what you are trying to do, and most of this code seems unrelated to your specific programming issue. Please clarify your question (what is a consecutive?) and reduce your code to include only the minimum amount necessary. – Michael Delgado Jun 08 '20 at 09:47
  • @MichaelDelgado sorry, I have edited the post to hopefully clear it up. All the code is necessary to achieved the wanted output, but just seems a bit of a 'hack' as stated in the post, so I was asking if there is a better way to do it. – WDS Jun 08 '20 at 10:39
  • I don't have the time to type together a complete solution but I can give you an Idea. 1. negate your boolean array. 2. use `np.where` to find indices of `True` values. 3. subtract those indices from shifted version of themselves to get the distances between consecutive 0s (i.e. streaks of 1s). 4. subtract 1. 5. group by year at starting index. 6. take the max of those groups. – swenzel Jun 08 '20 at 11:18

3 Answers3

1

Not sure if this is the most efficient, but it's one solution:

arr = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
arr.index = (pd.date_range('2000-12-22', '2001-01-06'))
arr = arr.astype(bool)
df = arr.reset_index()  # convert to df
df['adj_year'] = df['index'].dt.year  # adj_year will be adjusted for streaks

mask = (df[0].eq(True)) & (df[0].shift().eq(True))
df.loc[mask, 'adj_year'] = np.NaN  # we mark streaks as NaN and fill from above
df.adj_year = df.adj_year.fillna(method='ffill').astype('int')
df.groupby('adj_year').apply(lambda x: ((x[0] == x[0].shift()).cumsum() + 1).max())
# find max streak for each adjusted year

Output:

adj_year
2000    9
2001    3
dtype: int64

Note:

  • By convention variable names in Python (except for classes) are lower case , so arr as opposed to InputArray
  • 1 and 0 are equivalent to True and False, so you can make convert them to boolean without the explicit comparison
  • cumsum is zero-indexed (as is usual in Python) so we add 1
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
1

Now I've got the time. Here is my solution:

# Modules needed
import pandas as pd
import numpy as np

input_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1], dtype=bool)
input_dates = pd.date_range('2000-12-22', '2001-01-06')
df = pd.DataFrame({"input": input_array, "dates": input_dates})

streak_starts = df.index[~df.input.shift(1, fill_value=False) & df.input]
streak_ends = df.index[~df.input.shift(-1, fill_value=False) & df.input] + 1
streak_lengths = streak_ends - streak_starts

streak_df = df.iloc[streak_starts].copy()
streak_df["streak_length"] = streak_lengths

longest_streak_per_year = streak_df.groupby(streak_df.dates.dt.year).streak_length.max()

output:

dates
2000    9
2001    3
Name: streak_length, dtype: int64
swenzel
  • 6,745
  • 3
  • 23
  • 37
0

This solution doesn't answer the question exactly, so will not be the final answer. i.e. This regards max_consecutive trues at the boundaries for both current-year and following year

boolean_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1]).astype(bool)
boolean_array.index = (pd.date_range('2000-12-22', '2001-01-06'))

distinct = boolean_array.ne(boolean_array.shift()).cumsum() 
distinct_masked = distinct[boolean_array] 
streak_sum = distinct_masked.value_counts() 
streak_sum_series =  pd.Series(streak_sum.loc[distinct_masked].values, index = distinct_masked.index.copy())
max_consect = streak_sum_series.groupby(lambda x: x.year).max()

Output:

max_consect 
2000    9
2001    9
dtype: int64
WDS
  • 337
  • 3
  • 16