2

[I've attached a picture of my Series and the code to obtain the series , how would I obtain the number of days between a 1 and the next 0. For example, the number of days between the first 1 and next 0 is 4 days (1st August to 5th August], the number of days between the next 1 an 0 is also 4 days [8th august to 12 August 1

values = [1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1]
      
dates =['2019-08-01', '2019-08-02', '2019-08-05', '2019-08-06',
           '2019-08-07', '2019-08-08', '2019-08-09', '2019-08-12',
           '2019-08-13', '2019-08-14', '2019-08-15', '2019-08-16',
           '2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22',
           '2019-08-23', '2019-08-26', '2019-08-27', '2019-08-28',
           '2019-08-29', '2019-08-30']

pd.Series(values, index = dates)
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • 1
    welcome to SO! Please paste the data directly in the question and format as code. Makes easier to copy and replicate – Vivek Kalyanarangan Nov 15 '20 at 11:26
  • This is not your question, but please take a look at the accepted answer for https://stackoverflow.com/questions/57431667/pandas-fill-one-column-with-count-of-of-obs-between-occurrences-in-a-2nd-colu . (It counts how many until a new occurrence.) – rajah9 Nov 15 '20 at 11:57

3 Answers3

1

You try this using groupby like itertool.groupby here. The extract 1st index of every group. Since you have to find difference b/w two groups there have to be same number of 1 groups and 0 groups, if it's not the case then drop the last group.

s = pd.Series(values, index = pd.to_datetime(dates))
g = s.ne(s.shift()).cumsum()
vals = s.groupby(g).apply(lambda x:x.index[0])
# vals
1    2019-08-01
2    2019-08-05
3    2019-08-08
4    2019-08-12
5    2019-08-13
6    2019-08-14
7    2019-08-16
8    2019-08-23
9    2019-08-29
dtype: object

Now we dont have same number of 1 groups and 0 groups, so ditch the group index. And make chunks for size 2 i.e now, each has 1 and 0 groups indices.

end = None if not len(vals)%2 else -1
vals = vals.iloc[:end].values.reshape((-1, 2))
# vals 
array([['2019-08-01T00:00:00.000000000', '2019-08-05T00:00:00.000000000'],
       ['2019-08-08T00:00:00.000000000', '2019-08-12T00:00:00.000000000'],
       ['2019-08-13T00:00:00.000000000', '2019-08-14T00:00:00.000000000'],
       ['2019-08-16T00:00:00.000000000', '2019-08-23T00:00:00.000000000']],
      dtype='datetime64[ns]')

Now, we have to find the difference using np.diff.

days = np.diff(vals, axis=1).squeeze()
out = pd.Series(days)
# out

0   4 days
1   4 days
2   1 days
3   7 days
dtype: timedelta64[ns]
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
0

I think something like below should work, first have a series with a date index:

ds = pd.Series(values, index = pd.to_datetime(dates))

Then you calculate the difference between consecutive values:

delta = ds - ds.shift(fill_value=ds[0]-1)

It looks like this :

pd.DataFrame({'value':ds,'delta':delta})


    value   delta
2019-08-01  1   1
2019-08-02  1   0
2019-08-05  0   -1
2019-08-06  0   0
2019-08-07  0   0
2019-08-08  1   1
2019-08-09  1   0
2019-08-12  0   -1
2019-08-13  1   1
2019-08-14  0   -1

So the start dates you need are when delta is 1, and the next zero you need is where it is -1 .So:

starts = ds.index[delta == 1]
ends = ds.index[delta == -1]
(ends - starts[:len(ends)]).days

Int64Index([4, 4, 1, 7], dtype='int64')

Note there are some cases where at the end of the data frame, you have 1s but they don't flip into 0, so I ignore those.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks it worked, now I hav to do the same for another dataset, but where the indexes are intergers, do you have any idea how to do it? – VanillaChoco441 Nov 15 '20 at 13:59
  • when you create the index, you don't add ```pd.to_datetime()``` ? It should work if your indexes are not duplicated.. – StupidWolf Nov 15 '20 at 14:00
  • I get an error about the fill value needing to be scalar when the indexes are intergers – VanillaChoco441 Nov 15 '20 at 14:05
  • hi @VanillaChoco441, I cannot help you if I don't see the data. I tried my code above with a index that are integers and it worked. And I suspect you have to construct the pd series correctly. Also since you are new to SO, this is a bit out of the scope of your question – StupidWolf Nov 15 '20 at 14:11
  • I hope you would be fair to users who did their best to answer your question, and not move the goal post by asking more questions which are out of the scope... or pertaining to another dataset in this case – StupidWolf Nov 15 '20 at 14:12
0

Start from creation of a DataFrame with date column composed of dates converted to datetime and val column composed of values:

df = pd.DataFrame({'date': pd.to_datetime(dates), 'val': values})

The idea to get the result is:

  • Get dates where val == 0 (for other rows take NaT).
  • Perform "backwards filling".
  • Subtract date.
  • From the above result (timedelta) get the days number.
  • Fill outstanding NaT values (if any) with 0 (in your case this pertains to 2 last rows, which are not followed by any "0 row").
  • Save the result in dist column.

The code to do it is:

df['dist'] = (df.date.where(df.val == 0).bfill(0) - df.date)\
    .dt.days.fillna(0, downcast='infer')

The result is:

         date  val  dist
0  2019-08-01    1     4
1  2019-08-02    1     3
2  2019-08-05    0     0
3  2019-08-06    0     0
4  2019-08-07    0     0
5  2019-08-08    1     4
6  2019-08-09    1     3
7  2019-08-12    0     0
8  2019-08-13    1     1
9  2019-08-14    0     0
10 2019-08-15    0     0
11 2019-08-16    1     7
12 2019-08-19    1     4
13 2019-08-20    1     3
14 2019-08-21    1     2
15 2019-08-22    1     1
16 2019-08-23    0     0
17 2019-08-26    0     0
18 2019-08-27    0     0
19 2019-08-28    0     0
20 2019-08-29    1     0
21 2019-08-30    1     0

(dist column is the distance in days).

If you need, take from the above result only rows with val != 0.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41