How to find exact matches between rows in two columns?

Question

There are two columns in my dataset. It contains 33000 rows (huge). column 1 is called "Surname" column 2 is called "nickname"

I need to find out how many peoples surname is exactly the same as their nickname. can anyone find me a function for this in R??

Please add data using `dput` and show the expected output for the same. Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). — Ronak Shah, Aug 31 '20 at 14:53

score 1 · Answer 1 · answered Aug 27 '20 at 20:12

In your case, you can just simply create an logical test of equality between the two columns. After that, if you sum the logical values that result from this test, you get the number of TRUE's, or the number of rows, that have the same surname/nickname.

tab <- data.frame(
  nickname = sample(c("Ana", "Tese", "Maker"), size = 20, replace = TRUE),
  surname = sample(c("Ana", "Ed", "Philip"), size = 20, replace = TRUE)
)

tab$test <- tab$nickname == tab$surname

sum(tab$test)

score 0 · Answer 2 · edited Aug 27 '20 at 20:23

0

Fàîžà!

My solution involves creating a new column in your dataframe which indicates TRUE if the surname and nickname are exactly the same and FALSE if they are not exactly the same.

To do this, you need the dplyr package:

surname <- c("Smith", "Potter", "Smith") 
nickname <- c("Bobby", "Potter", "Smith")
df <- data.frame(surname = x, nickname = y)

Now that we have the dataframe, let's add the dplyr code:

library(dplyr)
df <- df %>% 
  mutate(equal_names = case_when(
    surname == nickname ~ TRUE, 
    surname != nickname ~ FALSE))

The result is:

> df
  surname nickname equal_names
1   Smith    Bobby       FALSE
2  Potter   Potter        TRUE
3   Smith    Smith        TRUE

case_when() returns whatever you want after the specified condition.

If you want more advanced screening, you'd need to check how regular expressions work. This post has a few hints about this.

edited Aug 27 '20 at 20:23

OTStats

1,820
1
13
22

answered Aug 27 '20 at 20:14

Gabriel Reis

74
6

1

`if_else()` would be an alternative to `case_when()` :) – OTStats Aug 27 '20 at 20:19
my dataset is huge like around 33000 rows :( – Fàîžà Tabàssùm Aug 27 '20 at 20:54
1

In that case, both `case_when` and `if_else` are unnecessary, since `==` already gives a logical vector that can be added directly as a column. – Alexlok Aug 28 '20 at 13:49

score 0 · Answer 3 · answered Aug 27 '20 at 20:24

A simple base R like below might work

sum(do.call("==",df))

Example

df <- structure(list(surname = c("A", "C", "A", "B", "A", "C", "C", 
"B", "B", "C"), nickname = c("C", "A", "A", "A", "B", "B", "B", 
"B", "C", "A")), class = "data.frame", row.names = c(NA, -10L
))

> df
   surname nickname
1        A        C
2        C        A
3        A        A
4        B        A
5        A        B
6        C        B
7        C        B
8        B        B
9        B        C
10       C        A

> sum(do.call("==",df))
[1] 2

How to find exact matches between rows in two columns?

3 Answers3