-1

In my previous question, i was trying to count blanks and build a dataframe with new columns for the subsequent analysis. The question became too exhaustive and i decided to split it for different purposes.

I have my initial dataset:

import pandas as pd
import numpy as np

df = pd.DataFrame({'id':[1000,2000,3000,4000],
               '201710':[7585,  4110,   4498,   np.nan],
               '201711':[7370,  3877,   4850,   4309],
               '201712':[6505,    np.nan,   4546,   4498],
               '201801':[7473,    np.nan,     np.nan,   4850],
               '201802':[6183,    np.nan,     np.nan,   np.nan ],
               '201803':[6699,  4558,   1429,   np.nan ],
               '201804':[ 118,  4152,   1429,   np.nan ],
               '201805':[  np.nan,  4271,   1960,   np.nan ],
               '201806':[  np.nan,    np.nan,   1798,   np.nan ],
               '201807':[  np.nan,    np.nan,   1612,   4361],
               '201808':[  np.nan,    np.nan,   1612,   4272],
               '201809':[  np.nan,  3900,   1681,   4199] ,
               
               })

I need to obtain start- and end- dates of each fraction (not blank) for each id. I managed to get the first occurred start- and the last occurred end- data but not in the middle. An then, I counted blanks for each gap (for further analysis)

The code is here (it's might seem confused):

# to obtain the  first and  last occurrence with data
res = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res.dropna(subset=['value'], inplace=True)
res.sort_values(by=['id', 'variable', 'value'], ascending=[True, True, True], 
inplace=True)

minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')

minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)


# To obtain number of gaps (nulls) and their length
res2 = pd.melt(df, id_vars=['id'], value_vars=df.columns[1:])
res2.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)
res2=res2.replace(np.nan, 0)
m = res2.value.diff().ne(0).cumsum().rename('gid')    
gaps = res2.groupby(['id', 
m]).value.value_counts().loc[:,:,0].droplevel(-1).reset_index()


# add columns to main dataset with start- and end dates and gaps
df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')

I came to the dataset like this, where start_date is the 1st notnull occurrence, end_date - the last notnull occurence and 1-,2-,3- blanks are fractions with blanks counting for further analysis: enter image description here

The output is intended to have additional columns:

enter image description here

Vero
  • 479
  • 5
  • 11
  • 2
    Could you please post your samples in form of text in your question, as images on SO are not that encouraged. – RavinderSingh13 Sep 14 '20 at 14:10
  • @RavinderSingh13, I was trying to do so, but it's too much data..getting confused. If you run the script in the question you will get the input and the output has additional columns that are start and end dates of each farction with data (not blanks) – Vero Sep 14 '20 at 14:24
  • 1
    Please post a code that you tried to run & explain better what are you trying to do – gtomer Sep 14 '20 at 14:30
  • @gtomer, I have edited the question. I came to this point and I do not how to solve further – Vero Sep 14 '20 at 15:28

1 Answers1

1

Here is a function that may be helpful, IIUC.

import pandas as pd

# create test data
t = pd.DataFrame({'x': [10, 20] + [None] * 3 + [30, 40, 50, 60] + [None] * 5 + [70]})

Create a function to find start location, end location, and size of each 'group', where a group is a sequence of repeated values (e.g., NaNs):

def extract_nans(df, field):
    df = df.copy()
    
    # identify NaNs
    df['is_na'] = df[field].isna()

    # identify groups (sequence of identical values is a group):  X Y X => 3 groups
    df['group_id'] = (df['is_na'] ^ df['is_na'].shift(1)).cumsum()

    # how many members in this group?
    df['group_size'] = df.groupby('group_id')['group_id'].transform('size')

    # initial, final index of each group
    df['min_index'] = df.reset_index().groupby('group_id')['index'].transform(min)
    df['max_index'] = df.reset_index().groupby('group_id')['index'].transform(max)

    return df

Results:

summary = extract_nans(t, 'x')
print(summary)

       x  is_na  group_id  group_size  min_index  max_index
0   10.0  False         0           2          0          1
1   20.0  False         0           2          0          1
2    NaN   True         1           3          2          4
3    NaN   True         1           3          2          4
4    NaN   True         1           3          2          4
5   30.0  False         2           4          5          8
6   40.0  False         2           4          5          8
7   50.0  False         2           4          5          8
8   60.0  False         2           4          5          8
9    NaN   True         3           5          9         13
10   NaN   True         3           5          9         13
11   NaN   True         3           5          9         13
12   NaN   True         3           5          9         13
13   NaN   True         3           5          9         13
14  70.0  False         4           1         14         14

Now, you can exclude 'x' from the summary, drop duplicates, filter to keep only NaN values (is_na == True), filter to keep sequences above a certain length (e.g., at least 3 consecutive NaN values), etc. Then, if you drop duplicates, the first row will summarize the first NaN run, second row will summarize the second NaN run, etc.

Finally, you can use this with apply() to process the whole data frame, if this is what you need.

Short version of results, for the test data frame:

print(summary[summary['is_na']].drop(columns='x').drop_duplicates())
   is_na  group_id  group_size  min_index  max_index
2   True         1           3          2          4
9   True         3           5          9         13
jsmart
  • 2,921
  • 1
  • 6
  • 13
  • thank you very much. This looks like what i need. I'll take a closer look at the code to understand it better. I love to learn new stuff about python with you, guys!!! – Vero Sep 14 '20 at 17:05
  • @ jsmart, Thank you a lot! Your example is very helpful! – Vero Sep 15 '20 at 09:46