1

I have two dataframes,

df1                       df2
country                   country
US                        AR
US                        AD
CA                        AO
CN                        AU
AR                        US

How do I group by them by combining the country list to a set the compare the difference between two dataframes?

My expected output will be like,

country code   df1_country_count   df2_country_count
AR                   1                    1
AD                   0                    1
AO                   0                    1
AU                   0                    1
US                   2                    1 
CA                   1                    0
CN                   1                    0
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Jiayu Zhang
  • 719
  • 7
  • 22
  • 2
    1) combine all the dataframes and assign a column identify the source: `df = pd.concat([d.assign(source=f'df{i}') for i, d in enumerate([df1, df2], 1)], ignore_index=True)` 2) use `pd.crosstab`: `counts = pd.crosstab(df.country, df.source)`. This option is simple, and works regardless of the number of sources. – Trenton McKinney Dec 13 '22 at 23:43

3 Answers3

2
(df1.value_counts().to_frame('df1_country_count')
 .join(df2.value_counts().to_frame('df2_country_count'), how='outer')
 .fillna(0).astype('int').rename_axis('country code'))

result:

               df1_country_count    df2_country_count
country code        
AD             0                    1
AO             0                    1
AR             1                    1
AU             0                    1
CA             1                    0
CN             1                    0
US             2                    1
Panda Kim
  • 6,246
  • 2
  • 12
2

You can use value_counts and then concat.

out = pd.concat([df1.country.value_counts(), 
           df2.country.value_counts()], axis=1).fillna(0).astype(int)
out.columns = ['df1_country', 'df2_country']
print(out)

    df1_country  df2_country
US            2            1
CA            1            0
CN            1            0
AR            1            1
AD            0            1
AO            0            1
AU            0            1
SomeDude
  • 13,876
  • 5
  • 21
  • 44
0
  1. Combine all of the dataframes, regardless of how many, with pd.concat and use .assign in a list comprehension to add a 'source' column.
    • source=f'df{i}': structure the f-string for how the column name should appear in the frequency table.
    • If loading the data from files, see Option 4 of this answer to load the csv files directly into a single dataframe with a new column.
  2. Use pd.crosstab to compute a frequency table of the two columns.
import pandas as pd

# sample dataframes
df1 = pd.DataFrame({'country': ['US', 'US', 'CA', 'CN', 'AR']})
df2 = pd.DataFrame({'country': ['AR', 'AD', 'AO', 'AU', 'US']})

# list of dataframes
df_list = [df1, df2]

# combine dataframes
df = pd.concat([d.assign(source=f'df{i}') for i, d in enumerate(df_list, 1)], ignore_index=True)

# create frequency table
counts = pd.crosstab(df.country, df.source)

source   df1  df2
country          
AD         0    1
AO         0    1
AR         1    1
AU         0    1
CA         1    0
CN         1    0
US         2    1
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158