Conditionally removing duplicates in R (20K observations)

Question

I am currently working in a large data set looking at duplicate water rights. Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. However, some rightIDs are listed more than once and do have relevance to my end goal. One example: there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.

I have this written at the moment,

#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
  if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
    remove(i)
  }
  print[i]
}

The original data frame is set up similarly:

RightID    Source        Use           MetalTagNu
1-0000     Wolf Creek    Irrigation    N/A
1-0000     Wolf Creek    Irrigation    12345
1-0001     Bear River    Domestic      N/A
1-0002     Beaver Stream Domestic      00001
1-0002     Beaver Stream Irrigation    00001

E.g. right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. However, right holder 1-0000 is unnecessary a repeat.

Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. Also, the duplicate and original entry will not be next to each other in the dataset.

I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.

please provide a minimal working example showing the dataframe — rgunning, May 23 '17 at 16:38
Removing duplicates has been answered here...[https://stackoverflow.com/questions/13967063/remove-duplicated-rows](https://stackoverflow.com/questions/13967063/remove-duplicated-rows) — Greg Hardin, May 23 '17 at 16:57

score 0 · Answer 1 · answered May 23 '17 at 17:10

So I would suggest the following:

1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). I don't know what this means. But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. So I propose that you take these rows in your data (let's call this data) out:

data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]

2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem:

deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]

3) Now you can merge the two dataframes back together:

final_data <- rbind(data_to_keep, deduped_data)

If this is what you wanted please up-mark and suggest that the answer is correct. Thanks!

score 0 · Answer 2 · answered May 23 '17 at 22:28

0

Create a new column,key, which is a combination of RightID & Use.

Assuming your dataframe is called df,

df$key <- paste(df$RightID,df$Use)

Then, remove duplicates using this command :

df1 <- df[!duplicated(df[,1],)]

df1 will have no duplicates.

answered May 23 '17 at 22:28

user7612897

26
2

Conditionally removing duplicates in R (20K observations)

2 Answers2