How to iterate over columns and check condition by group

Question

I have data for many countries over a period of time (2001-2003). It looks something like this:

index	year	country	inflation	GDP
1	2001	AFG	nan	48
2	2002	AFG	nan	49
3	2003	AFG	nan	50
4	2001	CHI	3.0	nan
5	2002	CHI	5.0	nan
6	2003	CHI	7.0	nan
7	2001	USA	nan	220
8	2002	USA	4.0	250
9	2003	USA	2.5	280

I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.

In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.

What's the best way to do that?

can you give an example of before and expected output? – Tomer S Jan 14 '22 at 18:53 — Tomer S, Jan 14 '22 at 18:53
Hey, sure. I have edited the question to be clearer. – Joao Pedro Bastos Jan 14 '22 at 19:11 — Joao Pedro Bastos, Jan 14 '22 at 19:11

Mohammad · Accepted Answer · 2022-01-14T20:46:56.173

1

This should work by filtering all values that have nan in one of (inflation, GDP):

(
    df.groupby(['country'])
    .filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)

Note, if you have more than two columns you can work on a more general version of this:

df.groupby(['country']).filter(lambda x: not x.isnull().all().any())

If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:

mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())

edited Jan 14 '22 at 20:46

answered Jan 14 '22 at 19:30

Mohammad

3,276
2
19
35

This worked! Thank you so much!! Is it possible to manipulate this for a range of years? If my total range is 1980-2000, and I want to drop if x.isnull() for all in the range (say) 1980-1990. – Joao Pedro Bastos Jan 14 '22 at 20:20
@JoaoPedroBastos yes indeed, you will need to create a mask based on the years you need (this could also be any condition you need) and then apply it to `x` in the lambda expression. I have updated my answer with an example. – Mohammad Jan 14 '22 at 20:47

score 1 · Answer 2 · answered Jan 14 '22 at 19:35

You can also try this:

# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()

# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])

# keep the rows contains the countries

df = df.drop(df[~df.country.isin(final_countries)].index)

score 0 · Answer 3 · answered Jan 14 '22 at 19:36

You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.

To convert from long to wide, you can use pivot functions. See this question too.

Here's code for dropping nulls, after its reshaped:

df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null

To convert back to long, you can use pd.melt.

How to iterate over columns and check condition by group

3 Answers3