Reshaping World Bank data in python

Question

I have the following data (below)

{'Country Name': {0: 'China', 1: 'China', 2: 'China'}, 'Country Code': {0: 'CHN', 1: 'CHN', 2: 'CHN'}, 'Series Name': {0: 'Consumer price index (2010 = 100)', 1: 'Age dependency ratio (% of working-age population)', 2: 'Age dependency ratio, old (% of working-age population)'}, '1972 [YR1972]': {0: '..', 1: '79.4811770762984', 2: '6.7804365054766'}, '1973 [YR1973]': {0: '..', 1: '78.8312385191076', 2: '6.83482919991518'}}

My desired output, having dropped the stub names on the years and reshaped to long-format

desired_output = pd.DataFrame({'Country Code': ['CHN', 'CHN'], 'Country Name' : ['China', 'China'],
                           'Year': ['1972', '1973'],
                           'Age dependency ratio' : ['79.4811770762984', '78.8312385191076' ],
                          'Age dependency ratio, old' : ['6.7804365054766', '6.83482919991518']})

I tried,

df.pivot(index = ['Country Name', 'Country Code'], columns = 'Series Name', values = ['1972 [YR1972]', '1973 [YR1973]'])

And it worked but still wasn't in the format I wanted.

As for removing the suffix, really can't figure it out, as it varies by year.

Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. — itprorh66, Feb 11 '23 at 18:55
Can you supply us with the dataframe before and after? (after being the resulting dataframe you wish to get) — Lidor Eliyahu Shelef, Feb 11 '23 at 19:30

Anoushiravan R · Answer 1 · 2023-02-11T21:25:58.627

You can try the following solution if your original dateframe is similar to one you shared. We need a little bit of data cleaning before we could actually pivot and get your desired output:

# First we extract future column names from your Series Name column and 
# assign it to a new column name, then we subset the resulting DataFrame
# of rows in where NaN values appear

df3 = (df.assign(Series_Name = lambda c: c['Series Name'].str.extract('(Age.*)(?=\s+\()'))
 .drop(columns='Series Name', axis=1).loc[lambda c: ~ c['Series_Name'].isnull()])


(df3.melt(id_vars=['Country Name', 'Country Code', 'Series_Name'],
         value_vars=['1972 [YR1972]', '1973 [YR1973]'], var_name='Year')
 .assign(Year = lambda c: c['Year'].str.extract('(\d+)(?=\s+)'))
 .pivot(index = ['Country Name', 'Country Code', 'Year'], columns='Series_Name', values='value'))




Series_Name                    Age dependency ratio Age dependency ratio, old
Country Name Country Code Year                                               
China        CHN          1972     79.4811770762984           6.7804365054766
                          1973     78.8312385191076          6.83482919991518

Reshaping World Bank data in python

1 Answers1