0

I need to perform validation in two worksheets to see if the same value is filled in both the sheets.

for example: in Sheet1, I have a column with Product and has value of WM001. in sheet2, I have a column with same name Product and has value of WM001 then its a Pass and I do nothing.

In other case

for example: in Sheet1, I have a column with Product and has value of WM001. in sheet2, I have a column with same name Product and has value of WM00X then its a Fail and I need to output the cell location and name so that we can go and correct the entry.

I tried using merge outer join but in my case I need to compare the key itself. any suggestions.

Do I need to loop both dataframes?

Sheet1

Product
WM001
WM002

Sheet2 | Product | | -------- | | WM001 | | WM00X |

The output should tell that the product WM00X is not found anywhere in Sheet1 and then putput the location and the value WM00X.

user2653353
  • 83
  • 1
  • 1
  • 9
  • Does this answer your question? [Compare two columns using pandas](https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas) – rajah9 Mar 01 '21 at 13:08

1 Answers1

0

Simple Example

I have created a sample.xlsx file to try mimic your case below. "Sheet1" has a column product and so does "Sheet2".

The only difference between the two columns is record 2 and 5 (starting on a zero based index system) which you want to identify (in terms of there index).

Sheet1

Product|

WM001 | WM00X | WM001 | WM001 | WM00X | WM001 |

Sheet2

Product|

WM001 | WM00X | WM00X | WM001 | WM00X | WM00X |

Pandas allows you to compare columns directly and create a mask (of truth values) which we can use to filter the rows you want.

>>> import pandas as pd
>>> sht1 = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
>>> sht2 = pd.read_excel('sample.xlsx', sheet_name='Sheet2')

# Get the index's of those records that don't match in each sheet
>>> mask = sht1[sht1 != sht2].dropna().index
# Identify those discrepencies in both sheets.
>>> sht1.loc[mask,:]
>>> sht2.loc[mask,:]

Note

The last lines will print the index of those records and the values in the Product column. From there you should be able to identify the records.

If you want to do any inplace changes to the sheet with the index value I recommend using .iat[] in pandas https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html.

EDIT:

To deal with records with different lengths, assuming the indexes match up to the point of the sheet with the least number of rows. Obviously if sheet1 has 50 rows and sheet2 has 70 rows, you can only compare the first 50 rows between the both.

    >>> import pandas as pd
    >>> sht1 = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
    >>> sht2 = pd.read_excel('sample.xlsx', sheet_name='Sheet2')
    
    >>> min_length = min(len(sht1), len(sht2))
    >>> sht1_subset = sht1.loc[:min_length,'Product']
    >>> sht2_subset = sht2.loc[:min_length,'Product']
    # Get the Truth values of those records that don't match on product column in each sheet
    >>> mask = (sht1_subset != sht2_subset)
    # Identify those discrepencies in both sheets using `mask`
    >>> sht1.where(mask).dropna()
    >>> sht2.where(mask).dropna()

Print out mask to see what it returns, should be a series of True False values representing the expression of those records in Product field that are not equal.

  • In my case its getting "Can only compare identically-labeled Series objects", my dataframes doesnt have same number of rows. – user2653353 Mar 01 '21 at 22:50