0

I have two datasets df1 and df2

P1 <- c('A', 'A', 'B', NA)
P2 <- c('B', NA, 'B', 'B')
P3 <- c('A', 'B', 'B', 'A')
P4 <- c('A', 'B', NA, 'B')
P5 <- c(NA, NA, NA, 'B')

df1 <- data.frame(P1, P2, P3, P4, P5, row.names = NULL)

enter image description here

[![enter image description here][2]][2]P1 <- c('A', 'A', 'B', 'B', 'A', 'B', 'B', 'A')
P2 <- c('B', 'B', 'B', 'B', 'B', 'B', 'A', 'B')
P3 <- c('A', 'B', 'B', 'A', 'A', 'B', 'B', 'A')
P4 <- c('A', 'B', 'B', 'B', 'A', 'B', 'A', 'B')
P5 <- c('B', 'B','B', 'B', 'B', 'B', 'A', 'B')

df2 <- data.frame(P1, P2, P3, P4, P5, row.names = NULL)

enter image description here

I need to count how many times each row from df1 appears in df2. If a value in df1 is NA, it can be both A and B in the df2. So for example, row #4 from df1 will be counted as rows #4 and #8 in df2.

Yulia Kentieva
  • 641
  • 4
  • 13

3 Answers3

3

You may try

row_appears <- c()
for (i in 1:nrow(df1)){
  x <- df1[i,]
  y <- df1[i,]
  x[is.na(x)] <- "A"
  y[is.na(y)] <- "B"
  z <- sum(apply(df2, 1, function(t) all(x == t)) + apply(df2, 1, function(t) all(y == t)))
  row_appears <- c(row_appears, z)
}
row_appears

[1] 2 1 2 2
Park
  • 14,771
  • 6
  • 10
  • 29
  • it's been a while since I asked this question. But I just noticed this loop applied on real data doesn't give correct results. The reason this algoritm assigns all NA values with all A's or all B's. So, if I have a row like this "A, A, B, NA, NA", this algorithm will convert it into "A, A, B, A, A" and "A, A, B, B, B". However, two last NA's can have different values like this: "A, A, B, A, B" and "A, A, B, B, A". So, those patterns remained uncounted. – Yulia Kentieva Jan 17 '22 at 18:15
2

Alternatively, we can treat is as a character match question, converting both data frames into character vectors, and treating NA as possibly A or B.

df1[is.na(df1)] <- "(A|B)" # regex talk for "might be A or B"

x <- do.call(paste, c(df1, sep = ""))
y <- do.call(paste, c(df2, sep = ""))

x |>
  lapply(\(.) stringi::stri_count_regex(y, .)) |>
  lapply(sum) |>
  unlist(use.names = F)

#> [1] 2 1 2 2

Or for R versions older than 4.1.0:

vapply(x, function(o) sum(stringi::stri_count_regex(y, o)), 1, USE.NAMES = F)
Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • I think this answer can work better than the first one. Is there a way to substitute |> piping as R 4.0.3 doesn't seem to support it? I tried %>% but I guess it works in a different way. – Yulia Kentieva Jan 17 '22 at 18:22
  • @YuliaKentieva the `|>` pipe and function shorthand were introduced in R 4.1.0, see [this post](https://stackoverflow.com/questions/67633022/what-are-the-differences-between-rs-new-native-pipe-and-the-magrittr-pipe) for differences with magrittr's `%>%`. `|>` is purely syntax shifting - meaning the same can be obtained by rewriting the approach. I have added an approach to the original post using `vapply`, which returns a vector instead of a list. – Donald Seinen Jan 18 '22 at 05:57
2

You can also use {tidyverse} or {data.table}.

library(tidyverse)
df3 <- bind_rows(
  df1 |> mutate(across(everything(), replace_na, "A")),
  df1 |> mutate(across(everything(), replace_na, "B"))
)

df2 |> 
  group_by_all() |> 
  summarise(N = n(), .groups = "drop") |> 
  right_join(df3, by = paste0("P", 1:5)) |> 
  mutate(N = replace_na(N, 0))

# # A tibble: 8 x 6
# P1    P2    P3    P4    P5        N
# <chr> <chr> <chr> <chr> <chr> <dbl>
# 1 A     B     A     A     B         2
# 2 A     B     A     B     B         1
# 3 A     B     B     B     B         1
# 4 B     B     A     B     B         1
# 5 B     B     B     B     B         2
# 6 A     B     A     A     A         0
# 7 A     A     B     B     A         0
# 8 B     B     B     A     A         0

library(data.table)
setDT(df1)
setDT(df2)

df1_a <- df1_b <- copy(df1)
df1_a[is.na(df1_a)] <- "A"
df1_b[is.na(df1_b)] <- "B"
df3 <- rbindlist(list(df1_a, df1_b))

df4 <- 
  df2[, .N, by = eval(paste0("P", 1:5))
  ][df3, on = paste0("P", 1:5)]

df4[, N := fifelse(is.na(N), 0, N)][]


#    P1 P2 P3 P4 P5 N
# 1:  A  B  A  A  A 0
# 2:  A  A  B  B  A 0
# 3:  B  B  B  A  A 0
# 4:  A  B  A  B  B 1
# 5:  A  B  A  A  B 2
# 6:  A  B  B  B  B 1
# 7:  B  B  B  B  B 2
# 8:  B  B  A  B  B 1
Zaw
  • 1,434
  • 7
  • 15