Comparing two DataFrames for partial row equality and output equal rows into a new DataFrame?

Question

I have two existing DataFrames which I have named death and air:

County,Death Rate
Autauga,859
Baldwin,976

County,AQI
Baldwin,51
Clay,45

These datasets were taken from different sources and are of different lengths, the same counties do not appear in each DataFrame.

When values for County match I need to create a third DataFrame with only columns AQI and Death Rate.

Here is what I started with (death is the larger df):

import pandas as pd

death = pd.read_csv('SimpleDeath1.csv')
air = pd.read_csv('simpleAir.csv')

data = pd.DataFrame(columns= ['AQI', 'Death Rate'], index=None)

for i in range (0, death.size):
    if death['County'] == air['County']:
        data.append({'AQI' : air['AQI'], 'Death Rate' : death['Death Rate']})

This outputs the following error:

ValueError: Can only compare identically-labeled Series objects

Which has been extensively asked about and discussed on SO. There are actually 382 questions returned when searching SO for this specific error, and while I haven't yet read them all, I have read enough to doubt the efficiency of my initial approach to this problem.
Some highlights from what I've read thus far.

Can anyone with fresh eyes help me to understand a better way of approaching this problem?

Some things I have tried:

Changing comparison:

if death['County'].equals(air['County')]:

Doesn't throw an error, but my new DataFrame is empty

Converting DataFrame values to strings:

if death['County'].str() == air['County'].str()): data.append({'AQI' : air['AQI'], 'Death Rate' : death['Death Rate']})

Throws:

TypeError: 'StringMethods' object is not callable

Any help using DataFrames or another strategy would be greatly appreciated!

`pd.merge(df1,df2,how='inner')` OR `df1.merge(df2,how='inner')` For more info, just do a CTRL + F for `inner` in this link: [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — David Erickson, Aug 21 '20 at 00:39
Thank you for sharing such a helpful link and not just marking my question down, I wish I had come across this sooner! — Elroy Berdahl, Aug 21 '20 at 00:54
yes, over time you will learn when is best to use `merge` vs. `append` vs. `concat`, which are the most common ways to join data together in pandas. — David Erickson, Aug 21 '20 at 00:56

cy9 · Accepted Answer · 2020-08-21T00:52:18.177

In this case, you can use the merge function from pandas:

import pandas as pd

death = {'County': ['Autauga', 'Baldwin'], 'Death Rate': [859, 976]}
air = {'County': ['Baldwin', 'Clay'], 'AQI': [51, 45]}

death = pd.DataFrame(death)
air = pd.DataFrame(air)

merged = death.merge(air, how='inner', on='County')
merged

This yields:

County  Death Rate  AQI
0   Baldwin 976 51

on='County' defines the column that contains the keys used in the merging operation.

how='inner' tells the code to select only keys that are present in both data frames based on the column 'County' as defined above.

You can then use the drop function from pandas to remove the unwanted 'County' column afterwards.

footfalcon · Answer 2 · 2020-08-21T01:20:47.080

0

import pandas as pd

death = pd.DataFrame({'County': ['county1', 'county2', 'county3'], 'Deaths': [1,2,3]})
County  Deaths
0   county1 1
1   county2 2
2   county3 3

air = pd.DataFrame({'County': ['county2', 'county3', 'county4'], 'AQI': [21,32,43]})
County  AQI
0   county2 21
1   county3 32
2   county4 43

merged = pd.merge(death, air)
County  Deaths  AQI
0   county2 2   21
1   county3 3   32

edited Aug 21 '20 at 01:20

answered Aug 21 '20 at 00:57

footfalcon

581
5
16

Comparing two DataFrames for partial row equality and output equal rows into a new DataFrame?

2 Answers2