Dataframe merge based on all the columns of subset dataframe

Question

I have a superset dataframe and subset dataframe. Superset has n number of columns and subset has m (n > m).

Requirement is to compare m columns of subset with matching columns headings from the superset.

Note: dataframes contained data from csv files, subset being reference file and superset being the whole tool output file. And, both the dataframes will vary based on the requirement under test.

e.g.1.

Superset columns:

Car_Brand, Car_model, Color, Year, Engine

Subset columns:

Year, Engine

I have to log failure if the entries of 'Year' 'Engine' are not matching between both the dataframes.

e.g.2:

Superset columns:

Car_Brand, Car_model, Color, Year, Engine, Country, Rating, Price

Subset columns:

Car_model, Rating, Price

I have to log failure if the entries of Car_model, Rating, Price are not matching between both the dataframes.

There are 100s of such different cases, need to write generic way to merge superset & subset based on the column names of subset.

How can I achieve this?

Something like:

common_df = superset_df.merge(subset_df, on=subset_df.columns[0], how='inner')

This is what I used finally, simple and precise, common_df = superset_df[list(subset_df.columns.values)] — Arti, Sep 18 '22 at 13:35

score 1 · Answer 1 · answered Sep 09 '22 at 17:00

Hope this will help even if the subset dataframe has variable column numbers, and the same columns are present in Superset dataframe.

superset = {'Car_Brand':['mahi01','ta02','suz03','hon04','hyu05','ki06'],
            'Car_model':['xu01','nex02','bal03','ama04','cre05','son06'], 
            'Color':['white','blue','red','grey','black','beige'], 
            'Year':['2018','2019','2020','2021','2022','2017'],
            'Engine':['ab01','ab02','ab03','ab04','ab05','ab06']}

dfSuper = pd.DataFrame(superset, columns=['Car_Brand', 'Car_model', 'Color',
                                 'Year', 'Engine'])

subset = {'Year':['2018','2010','2020','2021','2020'],
          'Engine':['ab01','ab02','ab03','ab04','ab05']}
dfSub = pd.DataFrame(subset, columns=(['Year', 'Engine']))

df3Result = dfSuper.merge( dfSub, on=list(dfSub.columns.values), how='left', indicator='match')
df3Result['match'] = np.where(df3Result.match == 'both', True, False)

Thank you Arpan. I got some idea, though my requirement is bit complex but this would definitely help. — Arti, Sep 09 '22 at 21:42
This is what I used finally, simple and precise, common_df = superset_df[list(subset_df.columns.values)] — Arti, Sep 18 '22 at 13:38

Dataframe merge based on all the columns of subset dataframe

1 Answers1