1

I have a dataframe like given bellow

date,value
2/10/19,34
2/11/19,34
2/12/19,34
2/13/19,34
2/14/19,34
2/15/19,34
2/16/19,34
2/17/19,0
2/18/19,0
2/19/19,0
2/20/19,22
2/21/19,22
2/22/19,22
2/23/19,22
2/24/19,0
2/25/19,0
2/26/19,0
2/27/19,0
2/28/19,1
3/1/19,2
3/2/19,2
3/3/19,1
3/4/19,0
3/5/19,0
3/6/19,0
3/7/19,3
3/8/19,3
3/9/19,3
3/10/19,0

After every interval dataframe has zero values, I want to group rows in such a way that if zero appears more two times continuously it should create a sub dataframe and save a file.

Output:


df1 
    2/17/19,0
    2/18/19,0
    2/19/19,0
df2
    2/24/19,0
    2/25/19,0
    2/26/19,0
    2/27/19,0
df3
    3/4/19,0
    3/5/19,0
    3/6/19,0

I tried many ways to do it but it fails.

Thank you.

jax
  • 3,927
  • 7
  • 41
  • 70

2 Answers2

1

You can try using rolling:

def merge_intervals(intervals):
    sorted_intervals = sorted(intervals, key=lambda x: x[0])
    interval_index = 0
    #print(sorted_intervals)
    for  i in sorted_intervals:

        if i[0] > sorted_intervals[interval_index][1]:
            interval_index += 1
            sorted_intervals[interval_index] = i
        else:
            sorted_intervals[interval_index] = [sorted_intervals[interval_index][0], i[1]]
    #print(sorted_intervals)
    return sorted_intervals[:interval_index+1]

end_ids = df[df['value'].rolling(3).apply(lambda x: (x==0).all())==1].index

start_ids = end_ids-3

intervals = merge_intervals([*zip(starts_ids, end_ids)])

for i,interval in enumerate(intervals):
    df[interval[0]+1:interval[1]+1].to_csv('df_' + str(i) + '.csv')

Not the prettiest code, but it works, the merge function was found here: Merging Overlapping Intervals in Python

Bruno Mello
  • 4,448
  • 1
  • 9
  • 39
  • I think code is close enough, but if I understand correctly "rolling(3)" creating some trouble and generating dataframe with 3 rows only and has common rows in different dataframes. I have to understand this code and modify accordingly. Although it's helpful. thank you. – jax Mar 26 '20 at 15:19
  • Sorry, I didn't see you wanted >=3, I fixed the code to work that way. – Bruno Mello Mar 26 '20 at 15:23
  • Hey this what I was trying for a couple of hours thank you appreciated. – jax Mar 26 '20 at 15:30
0

Find where values are equal to zero and take a rolling sum of length 3. Find where the rolling sums are equal to 3. The result will be lagged by 2 spaces so we take the logical or of the result with the -1 shifted and -2 shifted versions of the result.

mask = df['value'].eq(0).rolling(3).sum().eq(3)
mask |= mask.shift(-2) | mask.shift(-1)

In order to get groups, I take the cumulative sum of the logical negation. That will increment for each non-zero value and stagnate at the zeros. However, each group of zeros will be different. At the point I use groupby, that won't matter because I will have used the initial mask to see only the rows that satisfied the condition in the first place.

However, the resulting groups will be a non contiguous set of integers. Because I don't like that, I used factorize to give these groups unique integer values starting from zero.

grp_masked = (~mask).cumsum()[mask].factorize()[0]
g = df[mask].groupby(grp_masked)

Save files

for grp, d in g:
    d.to_csv(f'df_{grp}.csv', index=False)

Create a dictionary

df_dict = {grp: d for grp, d in g}

Details

This shows the original dataframe along with additional columns that show some of what we calculated.

group_series = pd.Series(
    grp_masked, df.index[mask], pd.Int64Dtype()
)

df_ = df.assign(
    EqZero=df['value'].eq(0),
    Roll2=df['value'].eq(0).rolling(3).sum(),
    Is3=df['value'].eq(0).rolling(3).sum().eq(3),
    Shift=lambda d: d.Is3.shift(-2) | d.Is3.shift(-1),
    Mask=mask,
    PreGrp=(~mask).cumsum(),
    Grp=group_series
)

df_

       date  value  EqZero  Roll2    Is3  Shift   Mask  PreGrp   Grp
0   2/10/19     34   False    NaN  False  False  False       1  <NA>
1   2/11/19      0    True    NaN  False  False  False       2  <NA>
2   2/12/19      0    True    2.0  False  False  False       3  <NA>
3   2/13/19     34   False    2.0  False  False  False       4  <NA>
4   2/14/19     34   False    1.0  False  False  False       5  <NA>
5   2/15/19     34   False    0.0  False  False  False       6  <NA>
6   2/16/19     34   False    0.0  False  False  False       7  <NA>
7   2/17/19      0    True    1.0  False   True   True       7     0
8   2/18/19      0    True    2.0  False   True   True       7     0
9   2/19/19      0    True    3.0   True  False   True       7     0
10  2/20/19     22   False    2.0  False  False  False       8  <NA>
11  2/21/19     22   False    1.0  False  False  False       9  <NA>
12  2/22/19     22   False    0.0  False  False  False      10  <NA>
13  2/23/19     22   False    0.0  False  False  False      11  <NA>
14  2/24/19      0    True    1.0  False   True   True      11     1
15  2/25/19      0    True    2.0  False   True   True      11     1
16  2/26/19      0    True    3.0   True   True   True      11     1
17  2/27/19      0    True    3.0   True  False   True      11     1
18  2/28/19      1   False    2.0  False  False  False      12  <NA>
19   3/1/19      2   False    1.0  False  False  False      13  <NA>
20   3/2/19      2   False    0.0  False  False  False      14  <NA>
21   3/3/19      1   False    0.0  False  False  False      15  <NA>
22   3/4/19      0    True    1.0  False   True   True      15     2
23   3/5/19      0    True    2.0  False   True   True      15     2
24   3/6/19      0    True    3.0   True  False   True      15     2
25   3/7/19      3   False    2.0  False  False  False      16  <NA>
26   3/8/19      3   False    1.0  False  False  False      17  <NA>
27   3/9/19      3   False    0.0  False  False  False      18  <NA>
28  3/10/19      0    True    1.0  False  False  False      19  <NA>
piRSquared
  • 285,575
  • 57
  • 475
  • 624