I'm having trouble merging two dataframes in pandas. They are parts of a dataset split between two files, and they share some columns and values, namely 'name' and 'address'. The entries with identical values do not share their index with entries in the other file. I tried variations of the following line:
res = pd.merge(df, df_p, on=['name', 'address'], how="left")
When the how argument was set to 'left', the columns from df_p had no values. 'right' had the opposite effect, with columns from df being empty. 'inner' resulted in an empty dataframe and 'outer' duplicated the number of entries, essentially just appending the results of 'left' and 'right'.
I manually verified that there are identical combinations of 'name' and 'address' values in both files.
Edit: Attempt at merging on a single of those columns appears to be successful, however I want to avoid merging incorrect entries in case 2 people with identical names have different addresses and vice versa
Edit1: Here's some more information on the data-set.
df.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3983 entries, 0 to 3982
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3983 non-null int64
1 name 3983 non-null object
2 address 3983 non-null object
3 race 3970 non-null object
4 marital-status 3967 non-null object
5 occupation 3971 non-null object
6 pregnant 3969 non-null object
7 education-num 3965 non-null float64
8 relationship 3968 non-null object
9 skewness_glucose 3972 non-null float64
10 mean_glucose 3572 non-null float64
11 capital-gain 3972 non-null float64
12 kurtosis_glucose 3970 non-null float64
13 education 3968 non-null object
14 fnlwgt 3968 non-null float64
15 class 3969 non-null float64
16 std_glucose 3965 non-null float64
17 income 3974 non-null object
18 medical_info 3968 non-null object
19 native-country 3711 non-null object
20 hours-per-week 3971 non-null float64
21 capital-loss 3969 non-null float64
22 workclass 3968 non-null object
dtypes: float64(10), int64(1), object(12)
memory usage: 715.8+ KB
example entry from df:
0,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201", White, Married-civ-spouse, Exec-managerial,f,9.0, Husband,1.904881822,79.484375,15024.0,0.667177618, HS-grad,147707.0,0.0,39.49544760000001, >50K,"{'mean_oxygen':'1.501672241','std_oxygen':'13.33605383','kurtosis_oxygen':'11.36579476','skewness_oxygen':'156.77910559999995'}", United-States,60.0,0.0, Private
df_p.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 3933 non-null int64
1 name 3933 non-null object
2 address 3933 non-null object
3 age 3933 non-null int64
4 sex 3933 non-null object
5 date_of_birth 3933 non-null object
dtypes: int64(2), object(4)
memory usage: 184.5+ KB
sample entry from df_p:
2273,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201",44, Male,1975-03-26
As you can see, the chosen samples are for the same person, but their index does not match, which is why I tried using the name and address columns.
Edit2: Changing the order of df and df_p in the merge seems to have solved the issue, though I have no clue why.