count how many times a row appears in a different dataset

Question

I have two datasets df1 and df2

P1 <- c('A', 'A', 'B', NA)
P2 <- c('B', NA, 'B', 'B')
P3 <- c('A', 'B', 'B', 'A')
P4 <- c('A', 'B', NA, 'B')
P5 <- c(NA, NA, NA, 'B')

df1 <- data.frame(P1, P2, P3, P4, P5, row.names = NULL)

[![enter image description here][2]][2]P1 <- c('A', 'A', 'B', 'B', 'A', 'B', 'B', 'A')
P2 <- c('B', 'B', 'B', 'B', 'B', 'B', 'A', 'B')
P3 <- c('A', 'B', 'B', 'A', 'A', 'B', 'B', 'A')
P4 <- c('A', 'B', 'B', 'B', 'A', 'B', 'A', 'B')
P5 <- c('B', 'B','B', 'B', 'B', 'B', 'A', 'B')

df2 <- data.frame(P1, P2, P3, P4, P5, row.names = NULL)

I need to count how many times each row from df1 appears in df2. If a value in df1 is NA, it can be both A and B in the df2. So for example, row #4 from df1 will be counted as rows #4 and #8 in df2.

score 3 · Answer 1 · answered Nov 16 '21 at 04:18

3

You may try

row_appears <- c()
for (i in 1:nrow(df1)){
  x <- df1[i,]
  y <- df1[i,]
  x[is.na(x)] <- "A"
  y[is.na(y)] <- "B"
  z <- sum(apply(df2, 1, function(t) all(x == t)) + apply(df2, 1, function(t) all(y == t)))
  row_appears <- c(row_appears, z)
}
row_appears

[1] 2 1 2 2

answered Nov 16 '21 at 04:18

Park

14,771
6
10
29

it's been a while since I asked this question. But I just noticed this loop applied on real data doesn't give correct results. The reason this algoritm assigns all NA values with all A's or all B's. So, if I have a row like this "A, A, B, NA, NA", this algorithm will convert it into "A, A, B, A, A" and "A, A, B, B, B". However, two last NA's can have different values like this: "A, A, B, A, B" and "A, A, B, B, A". So, those patterns remained uncounted. – Yulia Kentieva Jan 17 '22 at 18:15

Donald Seinen · Accepted Answer · 2022-01-18T05:58:07.683

2

Alternatively, we can treat is as a character match question, converting both data frames into character vectors, and treating NA as possibly A or B.

df1[is.na(df1)] <- "(A|B)" # regex talk for "might be A or B"

x <- do.call(paste, c(df1, sep = ""))
y <- do.call(paste, c(df2, sep = ""))

x |>
  lapply(\(.) stringi::stri_count_regex(y, .)) |>
  lapply(sum) |>
  unlist(use.names = F)

#> [1] 2 1 2 2

Or for R versions older than 4.1.0:

vapply(x, function(o) sum(stringi::stri_count_regex(y, o)), 1, USE.NAMES = F)

edited Jan 18 '22 at 05:58

answered Nov 16 '21 at 04:32

Donald Seinen

4,179
5
15
40

I think this answer can work better than the first one. Is there a way to substitute |> piping as R 4.0.3 doesn't seem to support it? I tried %>% but I guess it works in a different way. – Yulia Kentieva Jan 17 '22 at 18:22
@YuliaKentieva the `|>` pipe and function shorthand were introduced in R 4.1.0, see [this post](https://stackoverflow.com/questions/67633022/what-are-the-differences-between-rs-new-native-pipe-and-the-magrittr-pipe) for differences with magrittr's `%>%`. `|>` is purely syntax shifting - meaning the same can be obtained by rewriting the approach. I have added an approach to the original post using `vapply`, which returns a vector instead of a list. – Donald Seinen Jan 18 '22 at 05:57

Zaw · Answer 3 · 2021-11-16T05:03:18.107

You can also use {tidyverse} or {data.table}.

library(tidyverse)
df3 <- bind_rows(
  df1 |> mutate(across(everything(), replace_na, "A")),
  df1 |> mutate(across(everything(), replace_na, "B"))
)

df2 |> 
  group_by_all() |> 
  summarise(N = n(), .groups = "drop") |> 
  right_join(df3, by = paste0("P", 1:5)) |> 
  mutate(N = replace_na(N, 0))

# # A tibble: 8 x 6
# P1    P2    P3    P4    P5        N
# <chr> <chr> <chr> <chr> <chr> <dbl>
# 1 A     B     A     A     B         2
# 2 A     B     A     B     B         1
# 3 A     B     B     B     B         1
# 4 B     B     A     B     B         1
# 5 B     B     B     B     B         2
# 6 A     B     A     A     A         0
# 7 A     A     B     B     A         0
# 8 B     B     B     A     A         0

library(data.table)
setDT(df1)
setDT(df2)

df1_a <- df1_b <- copy(df1)
df1_a[is.na(df1_a)] <- "A"
df1_b[is.na(df1_b)] <- "B"
df3 <- rbindlist(list(df1_a, df1_b))

df4 <- 
  df2[, .N, by = eval(paste0("P", 1:5))
  ][df3, on = paste0("P", 1:5)]

df4[, N := fifelse(is.na(N), 0, N)][]


#    P1 P2 P3 P4 P5 N
# 1:  A  B  A  A  A 0
# 2:  A  A  B  B  A 0
# 3:  B  B  B  A  A 0
# 4:  A  B  A  B  B 1
# 5:  A  B  A  A  B 2
# 6:  A  B  B  B  B 1
# 7:  B  B  B  B  B 2
# 8:  B  B  A  B  B 1

count how many times a row appears in a different dataset

3 Answers3