1

Using distinct to remove duplicates within a combined dataset however Im losing data because distinct only keeps the first entry.

Example data frame "a"

 SiteID PYear   Habitat num.1
000901W 2011    W   NA
001101W 2007    W   NA
001801W 2005    W   NA
002001W 2017    W   NA
002401F 2006    F   NA
002401F 2016    F   NA
004001F 2006    F   NA
004001W 2006    W   NA
004101W 2007    W   NA
004101W 2007    W   16
004701F 2017    F   NA
006201F 2008    F   NA
006501F 2009    F   NA
006601W 2007    W   2
006601W 2007    W   NA
006803F 2009    F   NA
007310F 2018    F   NA
007602W 2017    W   NA
008103W 2011    W   NA
008203F 2007    F   1

Coding:

a<-distinct(a,SiteID, .keep_all = TRUE)

I would like to know how to remove duplicates based on SiteID and num.1 removing duplicates however I dont want to get rid of duplicates that have number values in the num.1 column. For example, in the dataframe a 004101W and 006601W have multiple entries but I want to keep the integer rather than the NA.

  • In that code, you specifically specified that you only want to consider the `SiteID` column when determining distinctness. Just use `distinct(a)` to remove rows that are duplicated only in *all* columns – divibisan Mar 19 '19 at 20:25
  • Possible duplicate of [Remove duplicated rows](https://stackoverflow.com/questions/13967063/remove-duplicated-rows) – divibisan Mar 19 '19 at 20:26
  • Thank you but Ive only given a very simple example dataframe, my actual dataframe has many more columns and if I do what your saying then it wont get rid of the duplicates. –  Mar 19 '19 at 20:30
  • Then provide a [mcve] that replicates your problem. If you read `?distinct` is says that if you pass in variable names (all arguments after the data frame itself) then it will only look at those columns to determine uniqueness, otherwise it will use all of them. So it's just a question of putting the right variable names into `distinct` – divibisan Mar 19 '19 at 20:31
  • Its not a duplicate of the post you reference b/c I need to remove duplicates but based on a condition within one column. –  Mar 19 '19 at 20:32
  • 1
    But that's not what you say in your question. Please [edit] your question so it includes a [mcve] that reproduces your problem. – divibisan Mar 19 '19 at 20:33

1 Answers1

1

(Thank you for updating with more representative sample data!)

a now has 20 rows, with 17 different SiteID values.

Three of those SiteIDs have multiple rows:

library(tidyverse)
a %>% 
  add_count(SiteID) %>%
  filter(n > 1)

## A tibble: 6 x 5
#  SiteID  PYear Habitat num.1     n
#  <chr>   <int> <chr>   <int> <int>
#1 002401F  2006 F          NA     2    # Both have NA for num.1
#2 002401F  2016 F          NA     2    #  ""

#3 004101W  2007 W          NA     2    # Drop 
#4 004101W  2007 W          16     2    # Keep this one

#5 006601W  2007 W           2     2    # Keep this one
#6 006601W  2007 W          NA     2    # Drop

If we want to prioritize the rows without NA in num.1, we can arrange by num.1 within each SiteID, such that NAs come last for each SiteID, and the distinct function will prioritize num.1's with a non-NA value.

(An alternative is also provided in case you want to keep the original sorting in a, but still moving NA values in num.1 to the end. In the is.na(num.1) term, NA's will evaluate as TRUE and will come after provided values, which will evaluate as FALSE.)

a %>% 
  arrange(SiteID, num.1) %>%
  #arrange(SiteID, is.na(num.1)) %>%    # Alternative to preserve orig order
  distinct(SiteID, .keep_all = TRUE)

    SiteID PYear Habitat num.1
1  000901W  2011       W    NA
2  001101W  2007       W    NA
3  001801W  2005       W    NA
4  002001W  2017       W    NA
5  002401F  2006       F    NA     # Kept first appearing row, since both NA num.1
6  004001F  2006       F    NA
7  004001W  2006       W    NA
8  004101W  2007       W    16     # Kept non-NA row
9  004701F  2017       F    NA
10 006201F  2008       F    NA
11 006501F  2009       F    NA
12 006601W  2007       W     2     # Kept non-NA row
13 006803F  2009       F    NA
14 007310F  2018       F    NA
15 007602W  2017       W    NA
16 008103W  2011       W    NA
17 008203F  2007       F     1

Import of sample data

a <- read.table(header = T, stringsAsFactors = F,
  text = " SiteID PYear   Habitat num.1
000901W 2011    W   NA
001101W 2007    W   NA
001801W 2005    W   NA
002001W 2017    W   NA
002401F 2006    F   NA
002401F 2016    F   NA
004001F 2006    F   NA
004001W 2006    W   NA
004101W 2007    W   NA
004101W 2007    W   16
004701F 2017    F   NA
006201F 2008    F   NA
006501F 2009    F   NA
006601W 2007    W   2
006601W 2007    W   NA
006803F 2009    F   NA
007310F 2018    F   NA
007602W 2017    W   NA
008103W 2011    W   NA
008203F 2007    F   1")
Jon Spring
  • 55,165
  • 4
  • 35
  • 53