Compare two column values and find distinct unique values causing differences

Question

I want to compare column1 and column2 and get the unique values causing difference to be detected from column1 (ignoring NaN). So in this case, I should get answer to be tb, 2 and rexth.Its comparing first column to second one

Also, can we create and store the result in another column?

df['diff'] = df['column1'].groupby(df['column2']).unique()

Result
index  column1      column2    diff
1.      John        John-'tb'  -tb
2.      Dec-1       Dec-2       2
3.      Teb12       rexth       rexth
4.      dx-tyr      nan         nan
5.     [fever]='1'  nan         nan
6.     [dsc]= rx2   NA          NA or leave it empty("")   
7.      fv=56       fv=rt276    rt276

Code can be in either R or Python. I dont mind

Shouldn't the last string be `rxth` since they both have `e` in the second position? — user3483203, Jul 22 '19 at 17:59
This question might help... https://stackoverflow.com/questions/28834459/extract-characters-that-differ-between-two-strings — Andrew Gustar, Jul 22 '19 at 17:59

score 0 · Answer 1 · answered Jul 22 '19 at 18:23

0

stringr::str_remove() gets you part of the way there:

library(tidyverse)
tibble(
  index   = c(1, 2, 3),
  column1 = c("John", "Dec-1", "Teb12"),
  column2 = c("John-'tb'", "Dec-2", "rexth"),
  diff    = c("-tb", "2", "rexth")
) ->
  data

data %>% 
  mutate(
    diff = str_remove(column2, column1)
  )

answered Jul 22 '19 at 18:23

jpboichuk

1
1

Not getting the difference column, you got same result? – Livingstone Jul 22 '19 at 18:35
This gives same output as column2 instead of splitting only those characters causing difference – Livingstone Jul 22 '19 at 18:39

Amit · Accepted Answer · 2019-07-23T09:14:16.203

0

You can try the below code.

import pandas as pd
from numpy import NaN

def FindDifference(Row):
    x = Row['column1']
    y = Row['column2']
    Difference = ""
    if pd.isnull(y) or y=="nan" or y=="NA":
        return NaN
    if len(x) <= len(y):
        for i in y:
            if i not in x:
                Difference += str(i)
    else:
        for i in x:
            if i not in y:
                Difference += str(i)
    return Difference

ReadData = pd.read_csv("ThePathToYourCSVfile.csv")
ReadData['diff']= ReadData.apply(lambda x: FindDifference(x),axis=1)

edited Jul 23 '19 at 09:14

answered Jul 22 '19 at 18:32

Amit

2,018
1
8
12

Can I capture only changes that happened in column2 coz I am having a situation where if column2 is nan, it picks column1 as the change – Livingstone Jul 23 '19 at 05:36
@Livingstone Do you want that if column 2 is nan there should be no output in the diff column? – Amit Jul 23 '19 at 05:41
Yes that is what I want – Livingstone Jul 23 '19 at 05:42
@Livingstone I have edited the answer to suit your requirement. Please let me know if you face any problems. – Amit Jul 23 '19 at 05:53
It has raised this error "("'str' object has no attribute 'isna'", 'occurred at index 0')" – Livingstone Jul 23 '19 at 05:57
In place where column2 doesnt has anything, its is nan, say first column had a value as dx-dy7 then column2 will have nan if there was no modification. I have edited the question above on the table just to show how it looks like and the result – Livingstone Jul 23 '19 at 05:59
@Livingstone I have edited it again. I am away from desktop, so I have not tested the code. Please try it and see you get what you want. – Amit Jul 23 '19 at 06:15
Alright...Let me keep testing then get back.Thank you – Livingstone Jul 23 '19 at 06:17
its still returning first column value in diff column instead of nan – Livingstone Jul 23 '19 at 08:09
Is it possible for you to paste a few rows of the DF after using the code as suggested above. – Amit Jul 23 '19 at 08:34
Yes let me edit the table above and paste some rows...with expected results on diff row – Livingstone Jul 23 '19 at 08:36
I have edited the table above from row 5 to 7 is a sample of what I do have here. whenever there is nan on column2 it means there was no difference between the two columns but where there is NA that means that cell was empty – Livingstone Jul 23 '19 at 08:45
@Livingstone It is a little strange for NaN to appear in smaller case as nan. Also NA is there. Were these values typed manually? You can check for it by opening the file and see if nan and NA are there. If yes then I will have to modify the answer slightly. – Amit Jul 23 '19 at 08:51
I have uploaded the sample datafile to github. take a look at how it is. The file have old columns and new columns..New columns have a prefix NEW_at beginning here is the link https://github.com/livingstone90/Compare-CSV/blob/master/DiffbCol.csv . Please confirm is you are able to see it.So i compare the old and new column and create new diff column for each pair – Livingstone Jul 23 '19 at 09:07
@Livingstone I could not see column1 or column2. Regardless , I have assumed there is nan and NA in the dataset. I have edited the code. – Amit Jul 23 '19 at 09:16
This is really eating our head...So still its getting the value from column1 where column2 has nan..what if you try that dataset using two columns that I am using here say Branching Logic column as old and New_Branching logic as New column.You will get what I am getting here – Livingstone Jul 23 '19 at 09:23

Compare two column values and find distinct unique values causing differences

2 Answers2