Let's say I have a simple Pandas DataFrame where one column contains a country name and another column contains some value. For example:
# Import Python Libraries
import numpy as np
import pandas as pd
# Create Sample DataFrame
df = pd.DataFrame(data={'Country': ['United States','United States','United States','United States', \
'United States','United States','United States','United States', \
'United States','United States','United States','United States', \
'Canada','Canada','Canada','Canada','Canada','Canada','Mexico', \
'Mexico','Mexico','Mexico','England','England','England','England', \
'England','England','England','England','England','England','England', \
'England','England','England','France','France','France','Spain','Germany', \
'Germany','Germany','Germany','Germany','Germany','Germany','Germany', \
'Germany','Germany'], 'Value': np.random.randint(1000, size=50)})
Which generates:
print(df.head())
Index Country Value
0 United States 943
1 United States 567
2 United States 534
3 United States 700
4 United States 470
My question is, what is the easiest way in Python to convert this DataFrame into one where each country has its own column and all the values of that country are listed in that column? In other words, how can I easily create a DataFrame where the number of columns is the unique count of countries in the 'Country' column, and that each column's length will vary depending on the number of times the corresponding country appears in the original DataFrame?
Here is sample code that provides a solution:
# Store Unique Country Names in Variable
columns = df['Country'].unique()
# Create Individual Country DataFrames
df_0 = df[df['Country'] == columns[0]]['Value'].values.tolist()
df_1 = df[df['Country'] == columns[1]]['Value'].values.tolist()
df_2 = df[df['Country'] == columns[2]]['Value'].values.tolist()
df_3 = df[df['Country'] == columns[3]]['Value'].values.tolist()
df_4 = df[df['Country'] == columns[4]]['Value'].values.tolist()
df_5 = df[df['Country'] == columns[5]]['Value'].values.tolist()
df_6 = df[df['Country'] == columns[6]]['Value'].values.tolist()
# Create Desired Output DataFrame
data_dict = {columns[0]: df_0, columns[1]: df_1, columns[2]: df_2, columns[3]: df_3, columns[4]: df_4, columns[5]: df_5, columns[6]: df_6}
new_df = pd.DataFrame({k:pd.Series(v[:len(df)]) for k,v in data_dict.items()})
Which generates:
print(new_df)
United States Canada Mexico England France Spain Germany
0 838.0 135.0 496.0 568.0 71.0 588.0 811.0
1 57.0 118.0 268.0 716.0 422.0 NaN 107.0
2 953.0 396.0 850.0 860.0 707.0 NaN 318.0
3 251.0 294.0 815.0 888.0 NaN NaN 633.0
4 127.0 466.0 NaN 869.0 NaN NaN 910.0
5 892.0 824.0 NaN 776.0 NaN NaN 472.0
6 11.0 NaN NaN 508.0 NaN NaN 466.0
7 563.0 NaN NaN 299.0 NaN NaN 200.0
8 864.0 NaN NaN 568.0 NaN NaN 637.0
9 810.0 NaN NaN 78.0 NaN NaN 392.0
10 268.0 NaN NaN 106.0 NaN NaN NaN
11 389.0 NaN NaN 153.0 NaN NaN NaN
12 NaN NaN NaN 217.0 NaN NaN NaN
13 NaN NaN NaN 941.0 NaN NaN NaN
While the above code works, it's obviously not a tenable solution for larger data sets. What is the most efficient way of generating this result from the original DataFrame?
Thank you!