2

I have data for many countries over a period of time (2001-2003). It looks something like this:

index year country inflation GDP
1 2001 AFG nan 48
2 2002 AFG nan 49
3 2003 AFG nan 50
4 2001 CHI 3.0 nan
5 2002 CHI 5.0 nan
6 2003 CHI 7.0 nan
7 2001 USA nan 220
8 2002 USA 4.0 250
9 2003 USA 2.5 280

I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.

In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.

What's the best way to do that?

3 Answers3

1

This should work by filtering all values that have nan in one of (inflation, GDP):

(
    df.groupby(['country'])
    .filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)

Note, if you have more than two columns you can work on a more general version of this:

df.groupby(['country']).filter(lambda x: not x.isnull().all().any())

If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:

mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
Mohammad
  • 3,276
  • 2
  • 19
  • 35
  • This worked! Thank you so much!! Is it possible to manipulate this for a range of years? If my total range is 1980-2000, and I want to drop if x.isnull() for all in the range (say) 1980-1990. – Joao Pedro Bastos Jan 14 '22 at 20:20
  • @JoaoPedroBastos yes indeed, you will need to create a mask based on the years you need (this could also be any condition you need) and then apply it to `x` in the lambda expression. I have updated my answer with an example. – Mohammad Jan 14 '22 at 20:47
1

You can also try this:

# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()

# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])

# keep the rows contains the countries

df = df.drop(df[~df.country.isin(final_countries)].index)
Tomer S
  • 900
  • 1
  • 10
  • 21
0

You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.

To convert from long to wide, you can use pivot functions. See this question too.

Here's code for dropping nulls, after its reshaped:

df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null

To convert back to long, you can use pd.melt.

Kat Cow
  • 24
  • 3