Remove Second Duplicated Row

Question

I need to merge two data frames: Complete10 and INSIDECostaRica. Inside Costa Rica contains a subset of the data in Complete10 that is the same but with some specific RICHNESS values changed from 0 to 0.5 (so that I can filter by these later). I tried merging them several ways, but it seems like the only way to keep my rows with 0.5 is to use all=TRUE, which means I end up with duplicates of these rows, a 0 row and a 0.5 row. I would like to use the distinct function to remove the rows with 0 values that are otherwise duplicated in a 0.5 row, but I think that will delete my 0.5 rows because my merge attached these values to the end of my dataset. For example, this is what my merge looked like, and this is what I have after the merge:

MergeTest<-merge(Complete10,INSIDECostaRica,by=c("GRID","x.centroids","y.centroids","RICHNESS"),all=TRUE)

GRID   LONG   LAT   RICHNESS
75      5      6     0
75      5      6     0.5
76      8      4     12
77      2      7     0
78      4      3     0
78      4      3     0.5

And this is what I would like to have:

GRID   LONG   LAT   RICHNESS
75      5      6     0.5
76      8      4     12
77      2      7     0
78      4      3     0.5

I don't want to remove all of the zeros from RICHNESS, just those that are duplicates of 0.5.

Please provide `dput()` for your actual data, or just a subset of your data, that can be used to reproduce the output you've posted. — andrew_reece, Dec 14 '20 at 04:41
I would recommend this post in order to make a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Marcio Rodrigues, Dec 14 '20 at 04:42

AnilGoyal · Accepted Answer · 2020-12-14T05:18:33.660

This would do the job

library(tidyverse)

MergeTest %>% group_by(GRID, LONG, LAT) %>% arrange(RICHNESS) %>% slice_tail()

# A tibble: 4 x 4
# Groups:   GRID, LONG, LAT [4]
   GRID  LONG   LAT RICHNESS
  <int> <int> <int>    <dbl>
1    75     5     6      0.5
2    76     8     4     12  
3    77     2     7      0  
4    78     4     3      0.5

OR

MergeTest %>% group_by(GRID, LONG, LAT) %>% filter(!(n()>1 & RICHNESS == 0))

# A tibble: 4 x 4
# Groups:   GRID, LONG, LAT [4]
   GRID  LONG   LAT RICHNESS
  <int> <int> <int>    <dbl>
1    75     5     6      0.5
2    76     8     4     12  
3    77     2     7      0  
4    78     4     3      0.5

sample dput

> dput(MergeTest)
structure(list(GRID = c(75L, 75L, 76L, 77L, 78L, 78L), LONG = c(5L, 
5L, 8L, 2L, 4L, 4L), LAT = c(6L, 6L, 4L, 7L, 3L, 3L), RICHNESS = c(0, 
0.5, 12, 0, 0, 0.5)), class = "data.frame", row.names = c(NA, 
-6L))

This separated out the highest values, but I need to keep the zeros that aren't duplicates and remove the zeros that are. When I sort this way, it puts all the zeros together in grid order which means I can't distinguish the ones to keep from the ones that need to be removed. — ElizaBeso000, Dec 14 '20 at 05:04

Ronak Shah · Answer 2 · 2020-12-14T04:52:30.990

0

One way after merging would be to keep the maximum value of RICHNESS for each group.

library(dplyr)
result <- MergeTest %>%
            group_by(GRID, LONG, LAT) %>%
            slice(which.max(RICHNESS))

I think you can fix your merge step if you remove 'RICHNESS' column from Complete10 if you want to keep it's value only from INSIDECostaRica.

MergeTest<-merge(Complete10[setdiff(names(Complete10), 'RICHNESS')],
                INSIDECostaRica,by=c("GRID","x.centroids","y.centroids"))

edited Dec 14 '20 at 04:52

answered Dec 14 '20 at 04:45

Ronak Shah

377,200
20
156
213

Unfortunately, I also need the values that aren't in INSIDECostaRica but are in Complete10 – ElizaBeso000 Dec 14 '20 at 05:16
@ElizaBeso000 Did you try adding `all = TRUE` in `merge` ? – Ronak Shah Dec 14 '20 at 05:36

Remove Second Duplicated Row

2 Answers2