1

In one of my previous questions I have been given different solutions for the task that helped me a lot. Now I faced with a different kind of problem that involves counting of blanks between periods and attributing begin and start dates based on several conditions.

Example:

Input

enter image description here

I need to attribute start and end dates, taking into consideration the gaps between dates. To get something like that:

Output

enter image description here

There was considered the following conditions in attributing dates:

  1. If the gap (number of blanks) between period is equal or less 2, the period is considered continuous (eg. id 3000). In this case, begin and end dates are as the first and the last date with data.

  2. If the gap > 2, so there is a need to attribute 2 or more periods (can be more than 3), in such a way that each faction gets its start- and end- dates, and if there is data in the last column, then the end-date is infinite.

  3. Also I need to calculate the average in each fraction defined.

What is the simplest way to do it?

To facilitate an example dataset:

import pandas as pd
data = pd.DataFrame({'id':[1000,2000,3000,4000],
               '201710':[7585,  4110,   4498,   np.nan],
               '201711':[7370,  3877,   4850,   4309],
               '201712':[6505,    np.nan,   4546,   4498],
               '201801':[7473,    np.nan,     np.nan,   4850],
               '201802':[6183,    np.nan,     np.nan,   np.nan ],
               '201803':[6699,  4558,   1429,   np.nan ],
               '201804':[ 118,  4152,   1429,   np.nan ],
               '201805':[  np.nan,  4271,   1960,   np.nan ],
               '201806':[  np.nan,    np.nan,   1798,   np.nan ],
               '201807':[  np.nan,    np.nan,   1612,   4361],
               '201808':[  np.nan,    np.nan,   1612,   4272],
               '201809':[  np.nan,  3900,   1681,   4199]                 
               })

What I have done for counting NANs is the following:

res = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
res=res.replace(np.nan, 0) 
m = res.value.diff().ne(0).cumsum().rename('gid') 
df1 = res.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)

This solution I found here and like it. However, I'm stuck in the next steps..

As a result I got a dataframe like this, where start_date is the 1st notnull occurrence, end_date - the last notnull occurence and 1-2-3 blanks are fractions with blanks counting for further analysis: enter image description here

you can obtain it via script

df = pd.DataFrame({'id':[1000,2000,3000,4000],
               '201710':[7585,  4110,   4498,   np.nan],
               '201711':[7370,  3877,   4850,   4309],
               '201712':[6505,    np.nan,   4546,   4498],
               '201801':[7473,    np.nan,     np.nan,   4850],
               '201802':[6183,    np.nan,     np.nan,   np.nan ],
               '201803':[6699,  4558,   1429,   np.nan ],
               '201804':[ 118,  4152,   1429,   np.nan ],
               '201805':[  np.nan,  4271,   1960,   np.nan ],
               '201806':[  np.nan,    np.nan,   1798,   np.nan ],
               '201807':[  np.nan,    np.nan,   1612,   4361],
               '201808':[  np.nan,    np.nan,   1612,   4272],
               '201809':[  np.nan,  3900,   1681,   4199] ,
               'start_date':[201710, 201710, 201710,201711],
               'end_date':[201804,201809, 201809,201809],
               '1st_blank':[5, 3,2,5],
               '2nd_blank':[np.nan, 3, np.nan, np.nan]
               })

Now, I need to get start and end dates of each fraction:

enter image description here

Vero
  • 479
  • 5
  • 11

1 Answers1

1

I don't have enough time.So it's one idea

df.set_index("id", inplace=True)
gap = 2
df.fillna("X", inplace=True).astype(str)
def get_series(arr):
    s = "".join(arr)
    res = [(i[0],i[-1]) for i in s.split("X" * gap)]
    # get index of i[0] and i[-1], and get its arr.index and return
    return [index1, index2, index3, index4....]
df['multi_index'] = df.apply(get_series, axis=1)
# do split columns 'multi_index'
Johnny
  • 687
  • 1
  • 6
  • 22
  • It's not completely enforceable – Johnny Sep 11 '20 at 10:49
  • i do not understand very well the idea – Vero Sep 11 '20 at 13:33
  • 1
    @Vero The row as a String, then separated by a blank position(gap*n), and then returns the beginning and end of each delimiter sequence. – Johnny Sep 14 '20 at 01:41
  • yes, i understand now and it makes sense. But code return an error. Can you help me to adjust it to mine? How can I pick a date of the first blank occurrence by row? – Vero Sep 14 '20 at 08:20
  • 1
    @Vero Instead of a string, you can use a sliding window to determine whether an array satisfies N*Gap, if true split array. [1,2,3,G,G,G,5,6,7] => [1,2,3] [5,6,7], then return [1.id,3.id], [5.id,7.id] – Johnny Sep 14 '20 at 10:20
  • thank you. but how can I pick the date to get a dataframe with the columns as above (I edited the question and added a frame I got up to now. Thank you for any help – Vero Sep 14 '20 at 12:45
  • I have created new question regarding this. If you want to respond here is the link https://stackoverflow.com/questions/63886007/pandas-picking-a-column-name-based-on-row-data. – Vero Sep 14 '20 at 14:10