How to convert monthy data to nan based on a condition?

Question

I have this df:

       CODE      DATE     TMAX  TMIN   PP
0      000130 1991-01-01  32.6  23.4  0.0
1      000130 1991-01-02  31.2  22.4  0.0
2      000130 1991-01-03  32.0   NaN  0.0
3      000130 1991-01-04  32.2  23.0  0.0
4      000130 1991-01-05  30.5  22.0  0.0
      ...        ...   ...   ...  ...
10865  000130 2020-12-31   NaN   NaN  NaN
10866  000132 1991-01-01  35.2   NaN  0.0
10867  000132 1991-01-02  34.6   NaN  0.0
10868  000132 1991-01-03  35.8   NaN  0.0
10869  000132 1991-01-04  34.8   NaN  0.0

I want to convert monthly data to NaN only if there is 5 or more consecutive NaN values in a month.

For example: If January 1991 have 5 consecutive or more NaN values in column TMAX, all January 1991 values of column TMAX must be converted to NaN. Same with every month in every year. I need to do this by CODE (Every CODE values has TMAX data in January 1991, February 1991, ... December 2020). So i'm thinking in use df.groupby['CODE'] first. There are 371 codes.

For PP column i need to convert monthly data to NaN only if there is 3 or more NON consecutive NaN values in a month. For example: If January 1991 have 3 NON consecutive NaN values in column PP, all January 1991 values of column TMAX must be converted to NaN. Same with every month in every year. I also need to do this by CODE.

I'm begginer in python so i will appreciate any help.

Thanks in advance.

score 1 · Accepted Answer · answered Mar 23 '21 at 20:12

identifying consecutive NaNs
a reasonable amount of code to generate a test data set
define function to be called in transform()
groupby() CODE, year & month calling function
n is a parameter so can be used for different consecutive runs

# contruct a CODE / DATE dataframe
df = (pd.DataFrame([f"{i:08d}" for i in range(3)], columns=["CODE"]).assign(foo=1)
 .merge(pd.DataFrame(pd.date_range("01-Jan-1991", "01-Jan-1993"), columns=["DATE"]).assign(foo=1), on="foo")
 .drop(columns="foo")
)


# add a column that has NaNs in it, some will consequetive...
A = np.random.uniform(20,30, len(df))
A.ravel()[np.random.choice(A.size, A.size//3, replace=False)] = np.nan
df["TMIN"] = A

# function that will return NaN if more than n consecutive NaNs
def consecutivenan(d, n=5):
    if d.isnull().astype(int).groupby(d.notnull().astype(int).cumsum()).sum().ge(n).any():
        return np.nan 
    else:
        return d

df["TMIN_C"] = df.groupby(["CODE", df.DATE.dt.year, df.DATE.dt.month], as_index=False)["TMIN"].transform(consecutivenan, n=5)

# demo it's worked
i = df.loc[~df.TMIN.isna() & df.TMIN.ne(df.TMIN_C)].index[0]
df.loc[i-2:i+28]

output

There are 5 consecutive NaN values so whole month has been set to NaN

	DATE	TMIN	TMIN_C
271	1991-09-29 00:00:00	28.1745	28.1745
272	1991-09-30 00:00:00	21.1691	21.1691
273	1991-10-01 00:00:00	28.7848	nan
274	1991-10-02 00:00:00	22.2346	nan
275	1991-10-03 00:00:00	22.306	nan
276	1991-10-04 00:00:00	21.5774	nan
277	1991-10-05 00:00:00	23.8348	nan
278	1991-10-06 00:00:00	21.4416	nan
279	1991-10-07 00:00:00	nan	nan
280	1991-10-08 00:00:00	nan	nan
281	1991-10-09 00:00:00	nan	nan
282	1991-10-10 00:00:00	27.3833	nan
283	1991-10-11 00:00:00	27.2125	nan
284	1991-10-12 00:00:00	nan	nan
285	1991-10-13 00:00:00	nan	nan
286	1991-10-14 00:00:00	nan	nan
287	1991-10-15 00:00:00	nan	nan
288	1991-10-16 00:00:00	nan	nan
289	1991-10-17 00:00:00	24.8782	nan
290	1991-10-18 00:00:00	29.7879	nan
291	1991-10-19 00:00:00	27.0532	nan
292	1991-10-20 00:00:00	21.3854	nan
293	1991-10-21 00:00:00	nan	nan
294	1991-10-22 00:00:00	nan	nan
295	1991-10-23 00:00:00	23.6399	nan
296	1991-10-24 00:00:00	nan	nan
297	1991-10-25 00:00:00	24.7265	nan
298	1991-10-26 00:00:00	20.2296	nan
299	1991-10-27 00:00:00	22.0885	nan
300	1991-10-28 00:00:00	27.3212	nan
301	1991-10-29 00:00:00	nan	nan

How to convert monthy data to nan based on a condition?

1 Answers1

output