1

I would like to delete rows where column a has duplicate values based on the number of NAs in other columns. This is similar to this, but I can't get the counting NAs to work with the solutions there.

Deleting rows that are duplicated in one column based on the conditions of another column

Here is my toy dataset:

df1 <- data.frame(a = c("x","y","y","z","x", "z"), b = c(1,2,NA,4,8,3), c = c(NA,2,2,NA,NA,4), d= c(1:4,NA,NA))

which gives:

  a  b  c  d
1 x  1 NA  1
2 y  2  2  2
3 y NA  2  3
4 z  4 NA  4
5 x  8 NA NA
6 z  3  4 NA

I would like to only keep rows with unique values in column a, and keep only the rows with the least number of NAs in cols b & c (ignoring NAs in column d)

This is the code I came up with:

df1 %>%
 mutate(NAs= apply(is.na(cbind(b,c)), 1, sum)) %>%     
 group_by(a) %>%
 top_n(n=1, -NAs)

My problem is that top_n returns more than one row if there is a tie. In case of a tie, I just want the first row returned. And there's probably a better way to select columns in mutate than cbind. I also don't need the "NAs" variable i created using mutate. My desired output is this:

  a  b  c  d
  x  1 NA  1
  y  2  2  2
  z  3  4 NA
Esther
  • 441
  • 2
  • 15

3 Answers3

3

@markus suggested this could be an answer too. Perhaps it's true, as it can be helpful to keep the code short in case of dplyr, otherwise you can often end up with quite verbose scripts.

However, the main part in my opinion is rowSums as it's already been highlighted.

df1 %>% 
arrange(a, rowSums(is.na(.[, c("b", "c")]))) %>% 
distinct(a, .keep_all = TRUE)

  a b  c  d
1 x 1 NA  1
2 y 2  2  2
3 z 3  4 NA

P.S. If speed is your concern, then indeed I'd try to use as little dplyr verbs as possible, as e.g. the approach with only arrange & distinct is 3x faster than other approaches with group, slice, top_n, filter, etc.

arg0naut91
  • 14,574
  • 2
  • 17
  • 38
2

Here is an option

library(dplyr)
df1 %>%
  mutate(NAs = rowSums(is.na(.[, c("b", "c")]))) %>%
  group_by(a) %>%
  top_n(n = 1, -NAs) %>% 
  slice(1) %>% 
  select(-NAs)
# A tibble: 3 x 4
# Groups:   a [3]
#  a         b     c     d
#  <fct> <dbl> <dbl> <int>
#1 x         1    NA     1
#2 y         2     2     2
#3 z         3     4    NA

rowSums is a more efficient alternative to apply(..., 1, sum).


You might also try data.table. The solution below should be really fast (but potentially less readable).

library(data.table)
setDT(df1)
df1[df1[order(a, df1[, rowSums(is.na(.SD)), .SDcols = c("b", "c")]), .I[1], by = "a"]$V1]
#   a b  c  d
#1: x 1 NA  1
#2: y 2  2  2
#3: z 3  4 NA
markus
  • 25,843
  • 5
  • 39
  • 58
  • 1
    This is not a different approach, but just if you'd like to keep it less verbose: `df1 %>% arrange(a, rowSums(is.na(.[, c("b", "c")]))) %>% distinct(a, .keep_all = TRUE)` – arg0naut91 Oct 28 '18 at 21:06
  • 1
    @arg0naut Very nice. Why don't you turn your comment into an answer? – markus Oct 28 '18 at 21:09
1

A slightly different dplyr option:

df1 %>%
  mutate(miss = rowSums(is.na(cbind(b,c)))) %>%
  group_by(a) %>%
  filter(miss == min(miss)) %>%
  slice(1) %>%
  select(-miss) %>%
  ungroup()

Or:

df1 %>%
  mutate(miss = rowSums(is.na(cbind(b,c)))) %>%
  group_by(a) %>%
  mutate(dupl = seq_along(a)) %>%
  filter(miss == min(miss)) %>%
  filter(dupl == min(dupl)) %>%
  select(-miss, -dupl) %>%
  ungroup()
tmfmnk
  • 38,881
  • 4
  • 47
  • 67