0

I have a superset dataframe and subset dataframe. Superset has n number of columns and subset has m (n > m).

Requirement is to compare m columns of subset with matching columns headings from the superset.

Note: dataframes contained data from csv files, subset being reference file and superset being the whole tool output file. And, both the dataframes will vary based on the requirement under test.

e.g.1.

Superset columns:

Car_Brand, Car_model, Color, Year, Engine

Subset columns:

Year, Engine

I have to log failure if the entries of 'Year' 'Engine' are not matching between both the dataframes.

e.g.2:

Superset columns:

Car_Brand, Car_model, Color, Year, Engine, Country, Rating, Price

Subset columns:

Car_model, Rating, Price

I have to log failure if the entries of Car_model, Rating, Price are not matching between both the dataframes.

There are 100s of such different cases, need to write generic way to merge superset & subset based on the column names of subset.

How can I achieve this?

Something like:

common_df = superset_df.merge(subset_df, on=subset_df.columns[0], how='inner')
Arti
  • 293
  • 2
  • 3
  • 10
  • This is what I used finally, simple and precise, common_df = superset_df[list(subset_df.columns.values)] – Arti Sep 18 '22 at 13:35

1 Answers1

1

Hope this will help even if the subset dataframe has variable column numbers, and the same columns are present in Superset dataframe.

superset = {'Car_Brand':['mahi01','ta02','suz03','hon04','hyu05','ki06'],
            'Car_model':['xu01','nex02','bal03','ama04','cre05','son06'], 
            'Color':['white','blue','red','grey','black','beige'], 
            'Year':['2018','2019','2020','2021','2022','2017'],
            'Engine':['ab01','ab02','ab03','ab04','ab05','ab06']}

dfSuper = pd.DataFrame(superset, columns=['Car_Brand', 'Car_model', 'Color',
                                 'Year', 'Engine'])
subset = {'Year':['2018','2010','2020','2021','2020'],
          'Engine':['ab01','ab02','ab03','ab04','ab05']}
dfSub = pd.DataFrame(subset, columns=(['Year', 'Engine']))
df3Result = dfSuper.merge( dfSub, on=list(dfSub.columns.values), how='left', indicator='match')
df3Result['match'] = np.where(df3Result.match == 'both', True, False)

dfSuper:

enter image description here

enter image description here

  • Thank you Arpan. I got some idea, though my requirement is bit complex but this would definitely help. – Arti Sep 09 '22 at 21:42
  • This is what I used finally, simple and precise, common_df = superset_df[list(subset_df.columns.values)] – Arti Sep 18 '22 at 13:38