0

I have a big set of data with 100+ columns of data structured like:

country_a country_b year variable1 variable2 ...... varaible100

The goal is to have the 100 variables separated into 100 new dataframes, pivoted, and save them into csvs.

Below is the code for transforming one variable:

import pandas as pd

df = pd.DataFrame({
    'country_a': ['aa', 'bb', 'cc'],
    'country_b': ['xx', 'yy', 'zz'],
    'year': [2018, 2019, 2020],
    'var_a': [1, 0, 1],
    'var_b': [2, 1, 2],
    'var_c': [0, 1.6, 2.4]
})

print(df)

  country_a country_b  year  var_a  var_b  var_c
0        aa        xx  2018      1      2    0.0
1        bb        yy  2019      0      1    1.6
2        cc        zz  2020      1      2    2.4

Then I'd do a pivot:

table=pd.pivot_table(df, values='var_a', index=['country_a','country_b'],columns=['year']).reset_index()
table.to_csv('var_a.csv')

table will look like this:

    country_a   country_b   2018    2019    2020
0   aa  xx  1.0 NaN NaN
1   bb  yy  NaN 0.0 NaN
2   cc  zz  NaN NaN 1.0

I asked the first part of the question here: pandas: melt 100+ variables into 100+ new dataframes Now I have problems incorporating the pivot function for the separated dfs...

Thank you a lot!

anky
  • 74,114
  • 11
  • 41
  • 70
js3933
  • 15
  • 2

3 Answers3

0

Instead of Pivot , use pd.melt , it is effective in your case

Vivs
  • 447
  • 4
  • 11
0

We could use DataFrame.pivot_table then we could join with country column using DataFrame.filter.

new_df = (df.filter(regex='country')
            .join(df.pivot_table(index=df.index, columns='year', values='var_a'))
         )
print(new_df)

Output

  country_a country_b  2018  2019  2020
0        aa        xx   1.0   NaN   NaN
1        bb        yy   NaN   0.0   NaN
2        cc        zz   NaN   NaN   1.0

If you can't filter countries using DataFrame.filter then you can select the columns using:

list_columns_names = ['spain', 'england',..]
df[list_columns_names].join(df.pivot_table(...))

if the columns of the countries are together in the dataframe it may be easier to use iloc

num_countries = 10
df.iloc[:,:num_countries].join(df.pivot_table(...))

Another options is set_index + unstack:

new_df = (df.filter(regex='country')
            .join(df.set_index('year', append=True)['var_a'].unstack('year'))
         )
ansev
  • 30,322
  • 5
  • 17
  • 31
0

Here is a way to re-shape the original data frame (using melt, unstack and reset_index), followed by exporting each of var_a, var_b, ..., to its own CSV file:

df_new = (
    df.melt(id_vars=['country_a', 'country_b', 'year'], 
            var_name='variable', 
            value_name='value')
    .set_index(['country_a', 'country_b', 'year', 'variable'])
    .sort_index()
    .squeeze()
    .unstack(level='year')
    .fillna(0)              # for display purposes
    .astype(int)            # also for display purposes
    .reset_index(level=['country_a', 'country_b'])
)
print(df_new)

year     country_a country_b  2018  2019  2020
variable                                      
var_a           aa        xx     1     0     0
var_b           aa        xx     2     0     0
var_c           aa        xx     0     0     0
var_a           bb        yy     0     0     0
var_b           bb        yy     0     1     0
var_c           bb        yy     0     1     0
var_a           cc        zz     0     0     1
var_b           cc        zz     0     0     2
var_c           cc        zz     0     0     2

Now export each variable to its own CSV file:

for idx in df_new.index.unique():
    filename = f'{idx}.csv'
    with open(filename, 'wt') as handle:
        #df_new.loc[idx].to_csv(handle)      # <- un-comment this line in your code
        print(filename)
        print(df_new.loc[idx])
        print()

var_a.csv
year     country_a country_b  2018  2019  2020
variable                                      
var_a           aa        xx     1     0     0
var_a           bb        yy     0     0     0
var_a           cc        zz     0     0     1

var_b.csv
year     country_a country_b  2018  2019  2020
variable                                      
var_b           aa        xx     2     0     0
var_b           bb        yy     0     1     0
var_b           cc        zz     0     0     2

var_c.csv
year     country_a country_b  2018  2019  2020
variable                                      
var_c           aa        xx     0     0     0
var_c           bb        yy     0     1     0
var_c           cc        zz     0     0     2
jsmart
  • 2,921
  • 1
  • 6
  • 13
  • would you mind taking a look at this question? you're absolutely awesome with pandas.. https://stackoverflow.com/q/64827979/12809745 – js3933 Nov 15 '20 at 02:51