Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

Question

Consider the following two data.frames:

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)])

I would like to remove the exact rows of a1 that are in a2 so that the result should be:

A  B
4  d
5  e
4  d
2  b

Note that one row with 2 b in a1 is retained in the final result. Currently, I use a looping statement, which becomes extremely slow as I have many variables and thousands of rows in my data.frames. Is there any built-in function to get this result?

It isn't clear your output is correct, `2b` is in both to start with, am I missing something ? — steveb, Oct 10 '17 at 01:52
@steveb `2b` is twice in `a1`, so only one gets cancelled and one remains in the output. — Ronak Shah, Oct 10 '17 at 01:56
I think my answer does what you want. Agreed, it is hard to simplify. — steveb, Oct 10 '17 at 02:20

DWal · Answer 1 · 2017-10-10T04:15:16.947

The idea is, add a counter for duplicates to each file, so you can get a unique match for each occurrence of a row. Data table is nice because it is easy to count the duplicates (with .N), and it also gives the necessary function (fsetdiff) for set operations.

library(data.table)

a1 <- data.table(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)])
a2 <- data.table(A = c(1:3,2), B = letters[c(1:3,2)])

# add counter for duplicates
a1[, i := 1:.N, .(A,B)]
a2[, i := 1:.N, .(A,B)]

# setdiff gets the exception
# "all = T" allows duplicate rows to be returned
fsetdiff(a1, a2, all = T)

#    A B i
# 1: 4 d 1
# 2: 5 e 1
# 3: 4 d 2
# 4: 2 b 3

steveb · Answer 2 · 2017-10-10T14:59:16.783

You could use dplyr to do this. I set stringsAsFactors = FALSE to get rid of warnings about factor mismatches.

library(dplyr)

a1 <- data.frame(A = c(1:5, 2, 4, 2), B = letters[c(1:5, 2, 4, 2)], stringsAsFactors = FALSE)
a2 <- data.frame(A = c(1:3,2), B = letters[c(1:3,2)], stringsAsFactors = FALSE)

## Make temp variables to join on then delete later.
# Create a row number
a1_tmp <- 
    a1 %>%
    group_by(A, B) %>%
    mutate(tmp_id = row_number()) %>%
    ungroup()
# Create a count
a2_tmp <-
    a2 %>%
     group_by(A, B) %>%
     summarise(count = n()) %>%
     ungroup()

## Keep all that have no entry int a2 or the id > the count (i.e. used up a2 entries).
left_join(a1_tmp, a2_tmp, by = c('A', 'B')) %>%
    ungroup() %>% filter(is.na(count) | tmp_id > count) %>%
    select(-tmp_id, -count)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

EDIT

Here is a similar solution that is a little shorter. This does the following: (1) add a column for row number to join both data.frame items (2) a temporary column in a2 (2nd data.frame) that will show up as null in the join to a1 (i.e. indicates it's unique to a1).

library(dplyr)

left_join(a1 %>% group_by(A,B) %>% mutate(rn = row_number())             %>% ungroup(),
          a2 %>% group_by(A,B) %>% mutate(rn = row_number(), tmpcol = 0) %>% ungroup(),
          by = c('A', 'B', 'rn')) %>%
filter(is.na(tmpcol)) %>%
select(-tmpcol, -rn)

## # A tibble: 4 x 2
##       A     B
##   <dbl> <chr>
## 1     4     d
## 2     5     e
## 3     4     d
## 4     2     b

I think this solution is a little simpler (perhaps very little) than the first.

Yes, it does, Steveb; appreciate it. – RBL Oct 10 '17 at 13:09 — RBL, Oct 10 '17 at 13:09
Superb! Very compact! Appreciate it! – RBL Oct 10 '17 at 16:10 — RBL, Oct 10 '17 at 16:10

d.b · Accepted Answer · 2017-10-10T18:33:31.273

1

I guess this is similar to DWal's solution but in base R

a1_temp = Reduce(paste, a1)
a1_temp = paste(a1_temp, ave(seq_along(a1_temp), a1_temp, FUN = seq_along))

a2_temp = Reduce(paste, a2)
a2_temp = paste(a2_temp, ave(seq_along(a2_temp), a2_temp, FUN = seq_along))

a1[!a1_temp %in% a2_temp,]
#  A B
#4 4 d
#5 5 e
#7 4 d
#8 2 b

edited Oct 10 '17 at 18:33

answered Oct 10 '17 at 01:38

d.b

32,245
6
36
77

score 1 · Answer 4 · answered Oct 10 '17 at 15:14

Here's another solution with dplyr:

library(dplyr)
a1 %>%
  arrange(A) %>%
  group_by(A) %>%
  filter(!(paste0(1:n(), A, B) %in% with(arrange(a2, A), paste0(1:n(), A, B))))

Result:

# A tibble: 4 x 2
# Groups:   A [3]
      A      B
  <dbl> <fctr>
1     2      b
2     4      d
3     4      d
4     5      e

This way of filtering avoids creating extra unwanted columns that you have to later remove in the final output. This method also sorts the output. Not sure if it's what you want.

Remove exact rows and frequency of rows of a data.frame that are in another data.frame in r

4 Answers4

Linked