0

I'm having trouble merging two dataframes in pandas. They are parts of a dataset split between two files, and they share some columns and values, namely 'name' and 'address'. The entries with identical values do not share their index with entries in the other file. I tried variations of the following line:

res = pd.merge(df, df_p, on=['name', 'address'], how="left")

When the how argument was set to 'left', the columns from df_p had no values. 'right' had the opposite effect, with columns from df being empty. 'inner' resulted in an empty dataframe and 'outer' duplicated the number of entries, essentially just appending the results of 'left' and 'right'.

I manually verified that there are identical combinations of 'name' and 'address' values in both files.

Edit: Attempt at merging on a single of those columns appears to be successful, however I want to avoid merging incorrect entries in case 2 people with identical names have different addresses and vice versa

Edit1: Here's some more information on the data-set.

df.info() output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3983 entries, 0 to 3982
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        3983 non-null   int64  
 1   name              3983 non-null   object 
 2   address           3983 non-null   object 
 3   race              3970 non-null   object 
 4   marital-status    3967 non-null   object 
 5   occupation        3971 non-null   object 
 6   pregnant          3969 non-null   object 
 7   education-num     3965 non-null   float64
 8   relationship      3968 non-null   object 
 9   skewness_glucose  3972 non-null   float64
 10  mean_glucose      3572 non-null   float64
 11  capital-gain      3972 non-null   float64
 12  kurtosis_glucose  3970 non-null   float64
 13  education         3968 non-null   object 
 14  fnlwgt            3968 non-null   float64
 15  class             3969 non-null   float64
 16  std_glucose       3965 non-null   float64
 17  income            3974 non-null   object 
 18  medical_info      3968 non-null   object 
 19  native-country    3711 non-null   object 
 20  hours-per-week    3971 non-null   float64
 21  capital-loss      3969 non-null   float64
 22  workclass         3968 non-null   object 
dtypes: float64(10), int64(1), object(12)
memory usage: 715.8+ KB

example entry from df:

0,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201", White, Married-civ-spouse, Exec-managerial,f,9.0, Husband,1.904881822,79.484375,15024.0,0.667177618, HS-grad,147707.0,0.0,39.49544760000001, >50K,"{'mean_oxygen':'1.501672241','std_oxygen':'13.33605383','kurtosis_oxygen':'11.36579476','skewness_oxygen':'156.77910559999995'}", United-States,60.0,0.0, Private

df_p.info() output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3933 entries, 0 to 3932
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     3933 non-null   int64 
 1   name           3933 non-null   object
 2   address        3933 non-null   object
 3   age            3933 non-null   int64 
 4   sex            3933 non-null   object
 5   date_of_birth  3933 non-null   object
dtypes: int64(2), object(4)
memory usage: 184.5+ KB

sample entry from df_p:

2273,Curtis Brown,"32266 Byrd Island
Fowlertown, DC 84201",44, Male,1975-03-26

As you can see, the chosen samples are for the same person, but their index does not match, which is why I tried using the name and address columns.

Edit2: Changing the order of df and df_p in the merge seems to have solved the issue, though I have no clue why.

0 Answers0