-4
Name Address Account    a   b      Amount   Phone
John CA     4879759  qwqe   rerter  203     807789747
Nil  FD     1234455  iuyui  jhgjhg  4321    98797897
Was  FR     8979696  yikjh  kkjhk   45989   9899999
Nil  FD     1234455  iuyui  jhgjhg  4321    98797897
John CA     4879759  qwqe   rerter  203     807789747
Saw  PO     9873279  kjljl  bjhjh   765     3543656
Nil  FD     1234455  iuyui  jhgjhg  4321    98797897
Aws  IL     707009   dfdsf  sasd    2344    242545
John CA     4879759  qwqe   rerter  203     807789747

I want to pull out duplicate rows from this table with the help of R code. Table name is "Loan". I have 17 billion line items. Main key columns "Name, Address, Account, Amount, Phone". Guys I am looking forward to get some positive solution.

one more thing after that separation I want to save that duplicate data table in .csv format. I am new with R please help me on this also.

Theking
  • 11
  • 4
  • 8
    See [here](http://stackoverflow.com/questions/25041933), [here](http://stackoverflow.com/questions/22959635), [here](http://stackoverflow.com/questions/26703764), [here](http://stackoverflow.com/questions/12495345), [here](http://stackoverflow.com/questions/31933605), [here](http://stackoverflow.com/questions/13967063), [here](http://stackoverflow.com/questions/24881855/delete-all-duplicated-rows-in-r), and [here](http://stackoverflow.com/search?q=%5Br%5D+duplicated+rows), some links might be duplicated – zx8754 Nov 30 '15 at 10:39

2 Answers2

1

We can use duplicated to get all the duplicate rows based on the key columns ('nm1').

nm1 <- c("Name", "Address", "Account", "Amount", "Phone") 
df1[duplicated(df1[nm1])|duplicated(df1[nm1], fromLast=TRUE),]
# Name Address Account     a      b Amount     Phone
#1 John      CA 4879759  qwqe rerter    203 807789747
#2  Nil      FD 1234455 iuyui jhgjhg   4321  98797897
#4  Nil      FD 1234455 iuyui jhgjhg   4321  98797897
#5 John      CA 4879759  qwqe rerter    203 807789747
#7  Nil      FD 1234455 iuyui jhgjhg   4321  98797897
#9 John      CA 4879759  qwqe rerter    203 807789747
akrun
  • 874,273
  • 37
  • 540
  • 662
1

An extension to Akrun's answer, to include the key columns only in the duplication check:

mainCols = c("Name", "Address", "Account", "Amount", "Phone")
duplicatedRows = duplicated(loan[,mainCols])
duplicatedData = loan[duplicatedRows,]

# Name Address Account     a      b Amount     Phone
# 4  Nil      FD 1234455 iuyui jhgjhg   4321  98797897
# 5 John      CA 4879759  qwqe rerter    203 807789747
# 7  Nil      FD 1234455 iuyui jhgjhg   4321  98797897
# 9 John      CA 4879759  qwqe rerter    203 807789747