The goal is to extract rows from a dataframe/data.table which have:
- same values in two or more fields (here NAME and DOB); but
- different values in another field (here ID)
Currently I'm doing this:
library(data.table)
# load the data
customers <- structure(list(
NAME = c("GEETA SHYAM RAO", "B V RAMANA", "GONTU VENKATARAMANAIAH",
"DAMAT RAMAKRISHNA", "MARIAM SUDHAKAR", "VELPURI LAKSHMI SUJATHA",
"MOHAMMED LIYAKHAT ALI", "VENKATESHWARAN PONNAMBALAM",
"DEVARAKONDA SATISH BABU", "GEEDI RAMULU", "KANDU OBULESU",
"J PARVATHALU(TEMP.SUB-STAFF)", "DOKKA RAJESH", "G TULASIRAM REDDY",
"MALLELA CHIRANJEEVI", "MANEPALLI VENKATA RAVAMMA",
"DOKKA JAGADEESHWAR", "K KRISHNA", "B SUDARSHAN", "B ANNAPURNA",
"CHAVVA SHIVA RAMULU", "BIKASH BAHADUR CHITRE", "DARBAR ASHOK",
"VEMULAPALLY SANGAMESHWAR RAO", "MOHAMMED ABDUL HAKEEM ANWAR",
"MANEPALLI SHIV SHANKAR RAO", "MOHD MISKEEN MOHIUDDIN",
"KOTLA CHENNAMMA", "NAYAK SURYAKANTH", "GOPIREDDY INDIRA",
"MEKALA SREEDEVI", "K KRISHNA", "B V RAMANA", "KUMMARI VENKATESHAM",
"BHAVANI CONSRUCTIONS", "UPPUTHOLLA KOTAIAH", "YEDIDHA NIRMALA DEVI",
"MARIAM SUDHAKAR", "B ANNAPURNA", "VELPURI LAKSHMI SUJATHA",
"DARBAR ASHOK", "AMMANA VISHNU VARDHAN REDDY", "ZAITOON BEE",
"MOHD CHAND PASHA", "PALERELLA RAMESH", "GEEDI SRINIVAS",
"RAMAIAH SADU", "BIMAN BALAIAH", "KOTLA CHENNAMMA",
"VENKATESHWARAN PONNAMBALAM"),
DOB = c("13-02-1971", "15-01-1960", "01-07-1970", "10-03-1977",
"24-01-1954", "28-06-1971", "26-01-1980", "14-04-1969", "23-09-1978",
"15-08-1954", "09-10-1984", "20-02-1975", "29-09-1984", "03-03-1975",
"26-01-1979", "01-01-1964", "21-01-1954", "01-05-1964", "12-03-1975",
"12-12-1962", "10-03-1982", "14-05-1983", "03-01-1950", "04-03-1962",
"12-05-1966", "01-06-1960", "10-03-1964", "15-07-1958", "26-06-1979",
"02-04-1974", "10-01-1975", "01-05-1964", "15-01-1960", "08-08-1977",
NA, "05-04-1981", "29-08-1971", "24-01-1954", "12-12-1962",
"28-06-1971", "03-01-1950", "23-06-1970", "20-02-1960", "05-07-1975",
"10-01-1979", "31-08-1982", "10-08-1983", "10-03-1964",
"15-07-1958", "14-04-1969"),
ID = c(502969, 502902, 502985, 502981, 502475, 502267, 502976,
502272, 502977, 502973, 502986, 502978, 502989, 502998, 502967,
502971, 502988, 502737, 502995, 502878, 502972, 502984, 502639,
502968, 502975, 502970, 502997, 502466, 502991, 502982, 502980,
502737, 502902, 502999, 502994, 502987, 502990, 502047, 502877,
502251, 502548, 502992, 503000, 502993, 502983, 502974, 502996,
502979, 502467, 502290),
PIN = c(500082, 500032, 500032, 500032,
500032, 500084, 500032, 500032, 500032, 500032, 500032, 500084,
500032, 500084, 500084, 500032, 5e+05, 500050, 500032, 500084,
500032, 500032, 500032, 500050, 500032, 500032, 500045, 500032,
500084, 500032, 500032, 500084, 500035, 500084, 500032, 500032,
500032, 500032, 500084, 500032, 500084, 500033, 500084, 500032,
500032, 500032, 500084, 500032, 500032, 500032)),
.Names = c("NAME", "DOB", "ID", "PIN"),
class = c("data.table", "data.frame"), row.names = c(NA,-50L))
Checkout the data:
dim(customers)
#[1] 50 4
head(customers)
NAME DOB ID PIN
#1: GEETA SHYAM RAO 13-02-1971 502969 500082
#2: B V RAMANA 15-01-1960 502902 500032
#3: GONTU VENKATARAMANAIAH 01-07-1970 502985 500032
#4: DAMAT RAMAKRISHNA 10-03-1977 502981 500032
#5: MARIAM SUDHAKAR 24-01-1954 502475 500032
#6: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
Step1: Get rows with same values in NAME and DOB columns -
dup1 <- customers[, .(ID, PIN, .N), keyby=.(NAME, DOB)][N>1][, -"N"]
dup1
# NAME DOB ID PIN
# 1: B ANNAPURNA 12-12-1962 502878 500084
# 2: B ANNAPURNA 12-12-1962 502877 500084
# 3: B V RAMANA 15-01-1960 502902 500032
# 4: B V RAMANA 15-01-1960 502902 500035
# 5: DARBAR ASHOK 03-01-1950 502639 500032
# 6: DARBAR ASHOK 03-01-1950 502548 500084
# 7: K KRISHNA 01-05-1964 502737 500050
# 8: K KRISHNA 01-05-1964 502737 500084
# 9: KOTLA CHENNAMMA 15-07-1958 502466 500032
#10: KOTLA CHENNAMMA 15-07-1958 502467 500032
#11: MARIAM SUDHAKAR 24-01-1954 502475 500032
#12: MARIAM SUDHAKAR 24-01-1954 502047 500032
#13: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
#14: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
#15: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
#16: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
In the above result, the ID values of "B V RAMANA" and "K KRISHNA" are same in their duplicate rows and therefore need to be removed.
Step 2: Get rows with same values in NAME, DOB and ID columns -
dup2 <- dup1[, .(PIN, .N), keyby=.(NAME, DOB, ID)][N>1][, -"N"]
dup2
# NAME DOB ID PIN
#1: B V RAMANA 15-01-1960 502902 500032
#2: B V RAMANA 15-01-1960 502902 500035
#3: K KRISHNA 01-05-1964 502737 500050
#4: K KRISHNA 01-05-1964 502737 500084
Step 3: Now remove rows in Step 2 from rows in Step 1 to get the final result -
result <- fsetdiff(dup1, dup2)
result
# NAME DOB ID PIN
# 1: B ANNAPURNA 12-12-1962 502878 500084
# 2: B ANNAPURNA 12-12-1962 502877 500084
# 3: DARBAR ASHOK 03-01-1950 502639 500032
# 4: DARBAR ASHOK 03-01-1950 502548 500084
# 5: KOTLA CHENNAMMA 15-07-1958 502466 500032
# 6: KOTLA CHENNAMMA 15-07-1958 502467 500032
# 7: MARIAM SUDHAKAR 24-01-1954 502475 500032
# 8: MARIAM SUDHAKAR 24-01-1954 502047 500032
# 9: VELPURI LAKSHMI SUJATHA 28-06-1971 502267 500084
#10: VELPURI LAKSHMI SUJATHA 28-06-1971 502251 500032
#11: VENKATESHWARAN PONNAMBALAM 14-04-1969 502272 500032
#12: VENKATESHWARAN PONNAMBALAM 14-04-1969 502290 500032
In every case above, the NAME and DOB columns have duplicate values but the values in ID column for those duplicate rows are necessarily different.
That's three lines of processing code for getting the result but I'm sure there must be alternative methods. In this example, there are just four fields. With say, more than 50 fields it would be a tedious job to put all the field names in code even with copy-paste. Therefore, it would be really cool to create a re-usable function which could take as input -
- a dataframe/data.table
- a vector of just the fieldnames which must contain duplicate values
- a single fieldname which must contain different values
and output the result as a dataframe/data.table. Ideas please.
Side note: This feature is deemed so important in fraud analytics that a commercial software "CaseWare IDEA" offers it by the name "Duplicate Key Exclusion". Checkout this feature in action: https://www.youtube.com/watch?v=XqL4j8UXsKw