Pandas indexing and calculating over rows efficiency ideas vectorizing code

Question

I am somewhat new to pandas and feel like there should be a more effective way to get the difference of the min year and max year, so change, for each country without iterating over each country like I am doing. I would like to vectorize the code. Maybe it's just the way the dataset is organized but I have been struggling to find a vectorized solution.

Does anyone have an efficient idea of how to run this without iterating over countries like I am doing? I feel like there should be a way to do this. I added a sample of the dataset below my code sample.

new_columns = ['CountryName', 'Forest Area Change' ]
dff = pd.DataFrame(columns=new_columns) 

for country in countries:
    forest_area_1990 = df[(df.CountryName == country) & (df.IndicatorCode == 'AG.LND.FRST.ZS') & (df.Year == 1990)].Value.values
    forest_area_2015 = df[(df.CountryName == country) & (df.IndicatorCode == 'AG.LND.FRST.ZS') & (df.Year == 2015)].Value.values

    if forest_area_1990.size > 0 and forest_area_2015.size > 0:
        dff = dff.append({new_columns[0]:country, new_columns[1]: forest_area_2015[0] - forest_area_1990[0]}, ignore_index=True)

The dataset looks like the following:

            CountryName CountryCode                 IndicatorName   IndicatorCode  Year      Value  

11531340         Canada         CAN  Forest area (% of land area)  AG.LND.FRST.ZS  1990  38.299073  
21041940         Canada         CAN  Forest area (% of land area)  AG.LND.FRST.ZS  2015  38.166671  
11777751  United States         USA  Forest area (% of land area)  AG.LND.FRST.ZS  1990  33.022308  
21288351  United States         USA  Forest area (% of land area)  AG.LND.FRST.ZS  2015  33.899723

min-year and max-year: do you mean the first and last years, or the maximum difference between any two years? — Roy2012, Jun 12 '20 at 18:45
You could [`pivot_table`](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe) like `df.pivot_table(index=['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode'], columns='Year', values='Value')` then it should be as simple as subtracting the two columns. (could mask any countries, indicators you don't care about) — ALollz, Jun 12 '20 at 18:53

score 1 · Accepted Answer · answered Jun 12 '20 at 19:08

1

I believe what you're looking for is the following:

res = df[["CountryName", "Value"]].groupby("CountryName").agg(["first", "last"])
res.columns = ["first", "last"]
res["change"] = res["last"] - res["first"]

The result is:

                   first       last    change
CountryName                                  
Canada         38.299073  38.166671 -0.132402
United States  33.022308  33.899723  0.877415

answered Jun 12 '20 at 19:08

Roy2012

11,755
2
22
35

Thanks, that's exactly what I was looking for. Ideas on how to create a custom agg so that last and first are pesific years 1990 and 2015 since some of the countries have data from 1985 and some 2018 but I just want the 25-year change. I can see the .agg can use lambda but not sure how to access "Years" and .apply() can work to creates a custom function that can't go over columns to years but not sure what the recommend use might be. – Justin Bates Jun 12 '20 at 20:47
Otherwise, I might not .melt right away into this data frame format and do what @ALollz recommended. I just know that other graphs are easy with the .melted format for other things I am doing. – Justin Bates Jun 12 '20 at 20:51

Pandas indexing and calculating over rows efficiency ideas vectorizing code

1 Answers1