Count number of times the content of two columns are equal and different in dataframe in R

Question

I have this dataframe

df <- structure(list(`Prediction (Ge)` = c("Paranthropus", "Paranthropus", 
"Homo", "Paranthropus", "Australopithecus", "Paranthropus", "Paranthropus", 
"Australopithecus", "Paranthropus", "Australopithecus", "Paranthropus", 
"Australopithecus", "Australopithecus", "Australopithecus", "Australopithecus", 
"Paranthropus", "Homo", "Australopithecus", "Paranthropus", "Paranthropus", 
"Paranthropus", "Paranthropus", "Australopithecus", "Paranthropus", 
"Australopithecus", "Paranthropus", "Australopithecus"), `Prediction (Sp)` = c("Australopithecus africanus", 
"Paranthropus robustus", "Paranthropus boisei", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Paranthropus robustus", 
"Australopithecus afarensis", "Paranthropus boisei", "Paranthropus robustus", 
"Paranthropus robustus", "Paranthropus robustus", "Australopithecus afarensis", 
"Australopithecus afarensis", "Australopithecus afarensis", "Paranthropus robustus", 
"Homo habilis", "Australopithecus afarensis", "Paranthropus robustus", 
"Paranthropus boisei", "Paranthropus boisei", "Paranthropus robustus", 
"Australopithecus afarensis", "Paranthropus robustus", "Australopithecus afarensis", 
"Paranthropus robustus", "Australopithecus afarensis")), row.names = c(2L, 
3L, 6L, 7L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 19L, 20L, 26L, 
27L, 28L, 29L, 30L, 31L, 32L, 34L, 35L, 37L, 38L, 42L, 46L, 47L
), class = "data.frame", na.action = structure(c(`1` = 1L, `4` = 4L, 
`5` = 5L, `8` = 8L, `16` = 16L, `17` = 17L, `18` = 18L, `21` = 21L, 
`22` = 22L, `23` = 23L, `24` = 24L, `25` = 25L, `33` = 33L, `36` = 36L, 
`39` = 39L, `40` = 40L, `41` = 41L, `43` = 43L, `44` = 44L, `45` = 45L
), class = "omit"))

The head(df) allows to visualize how it looks like:

head(df)
    Prediction (Ge)            Prediction (Sp)
2      Paranthropus Australopithecus africanus
3      Paranthropus      Paranthropus robustus
6              Homo        Paranthropus boisei
7      Paranthropus      Paranthropus robustus
9  Australopithecus      Paranthropus robustus
10     Paranthropus      Paranthropus robustus

There are two columns, which come from two different predictions.

What I would like to know is if the genus in the second column (Prediction (Sp) is the same as the genus in Prediction (Ge). So this means that we need to compare the first word in the Prediction (Sp) with the value in Prediction (Ge).

If you analyze only the first six rows from head(df), I would say that there are 3 rows that are identical (rows number 3, 7 and 10), whereas there are 3 rows that are different (2, 6, 9).

How can I do it with a simple line of code, to get the total number of identical/different values?

score 1 · Answer 1 · answered Feb 01 '23 at 14:31

How about this:

library(dplyr)
library(stringr)

df %>% 
  mutate(right_genus = str_detect(`Prediction (Sp)`, `Prediction (Ge)`)) 
#>     Prediction (Ge)            Prediction (Sp) right_genus
#> 2      Paranthropus Australopithecus africanus       FALSE
#> 3      Paranthropus      Paranthropus robustus        TRUE
#> 6              Homo        Paranthropus boisei       FALSE
#> 7      Paranthropus      Paranthropus robustus        TRUE
#> 9  Australopithecus      Paranthropus robustus       FALSE
#> 10     Paranthropus      Paranthropus robustus        TRUE
#> 11     Paranthropus      Paranthropus robustus        TRUE
#> 12 Australopithecus Australopithecus afarensis        TRUE
#> 13     Paranthropus        Paranthropus boisei        TRUE
#> 14 Australopithecus      Paranthropus robustus       FALSE
#> 15     Paranthropus      Paranthropus robustus        TRUE
#> 19 Australopithecus      Paranthropus robustus       FALSE
#> 20 Australopithecus Australopithecus afarensis        TRUE
#> 26 Australopithecus Australopithecus afarensis        TRUE
#> 27 Australopithecus Australopithecus afarensis        TRUE
#> 28     Paranthropus      Paranthropus robustus        TRUE
#> 29             Homo               Homo habilis        TRUE
#> 30 Australopithecus Australopithecus afarensis        TRUE
#> 31     Paranthropus      Paranthropus robustus        TRUE
#> 32     Paranthropus        Paranthropus boisei        TRUE
#> 34     Paranthropus        Paranthropus boisei        TRUE
#> 35     Paranthropus      Paranthropus robustus        TRUE
#> 37 Australopithecus Australopithecus afarensis        TRUE
#> 38     Paranthropus      Paranthropus robustus        TRUE
#> 42 Australopithecus Australopithecus afarensis        TRUE
#> 46     Paranthropus      Paranthropus robustus        TRUE
#> 47 Australopithecus Australopithecus afarensis        TRUE

df %>% 
  mutate(right_genus = str_detect(`Prediction (Sp)`, `Prediction (Ge)`)) %>% 
  group_by(right_genus) %>% 
  tally()
#> # A tibble: 2 × 2
#>   right_genus     n
#>   <lgl>       <int>
#> 1 FALSE           5
#> 2 TRUE           22

^{Created on 2023-02-01 by the reprex package (v2.0.1)}

score 1 · Accepted Answer · answered Feb 01 '23 at 14:32

Using grepl applied separately to each row. No packages are used.

subset(df, mapply(grepl, `Prediction (Ge)`, `Prediction (Sp)`))
##     Prediction (Ge)            Prediction (Sp)
## 3      Paranthropus      Paranthropus robustus
## 7      Paranthropus      Paranthropus robustus
## 10     Paranthropus      Paranthropus robustus
## ...snip...

table(with(df, mapply(grepl, `Prediction (Ge)`, `Prediction (Sp)`)))
##
## FALSE  TRUE 
##     5    22

score 1 · Answer 3 · answered Feb 01 '23 at 14:38

You can use gsub() and table().

> df$a <- df$`Prediction (Ge)`
> df$b <- gsub(' .+$', '', df$`Prediction (Sp)`)
> table(df$a == df$b)

FALSE  TRUE 
    5    22

Add a column if you like.

> df$match <- df$a == df$b
> head(df)
    Prediction (Ge)            Prediction (Sp)                a
2      Paranthropus Australopithecus africanus     Paranthropus
3      Paranthropus      Paranthropus robustus     Paranthropus
6              Homo        Paranthropus boisei             Homo
7      Paranthropus      Paranthropus robustus     Paranthropus
9  Australopithecus      Paranthropus robustus Australopithecus
10     Paranthropus      Paranthropus robustus     Paranthropus
                  b match
2  Australopithecus FALSE
3      Paranthropus  TRUE
6      Paranthropus FALSE
7      Paranthropus  TRUE
9      Paranthropus FALSE
10     Paranthropus  TRUE

Count number of times the content of two columns are equal and different in dataframe in R

3 Answers3