Title says most of it. i.e. Find the maximum consecutive Ones/1s (or Trues) for each year, and if the consecutive 1s at the end of a year continues to the following year, merge them together. I have tried to implement this, but seems a bit of a 'hack', and wondering if there is a better way to do it.
Reproducible Example Code:
# Modules needed
import pandas as pd
import numpy as np
# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)
InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean
# Wanted Output
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Below is my initial code to achieved wanted output
# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number
distinct = distinct[boolean_array] # only consider trues from the distinct values
consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
return consect
# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 7
# 2001 3
However, output above is still not what we want because groupby function cuts the data for each year.
So below code we will try and 'fix' this by computing the MaxConsecutive-Ones at the boundaries (i.e. current_year-01-01 and previous_year-12-31), And if the MaxConsecutive-Ones at the boundaries are larger than compared to the original MaxConsecutive-Ones from above output then we replace it.
# First) we aquire all start_of_year and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]
# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]
# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index.
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year
# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index
# Finally) Compute the consecutive 1s/trues at the boundaries
# for each matched years
for year in matched_years:
# Compute the amount of consecutive 1s/trues at the start-of-year
start = boolean_array.loc[boolean_array.index.year == (year + 1)]
distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number
distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)
# Compute the amount of consecutive 1s/trues at the previous-end-of-year
end = boolean_array.loc[boolean_array.index.year == year]
distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number
distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)
# Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
ConsecutiveAtBoundaries = start_consecutive + end_consecutive
# Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
Modify_MaxConsecutive = MaxConsecutive.copy()
if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
else:
None
# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3