How to compare the number of statements that match between two string vectors

Question

I want to compare two string vectors as follows:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs","Mild")

Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy")

Filtered<-data.frame(Test1,Test2)

Intended output:

Number the same: 2
Number present in Test1 and not in Test2: 2
Number present in Test2 and not in Test1: 2

I would also like to see which strings are different so that the other intended output should be as follows (and also part of the original dataframe)

Same<-c("Everything is normal","Its raining cats and dogs")
OnlyInA<-c("It is all sunny")
OnlyInB<-c("It is thundering","Cloudy")

I have tried:

Filtered$Same<-intersect(Filtered$A,Filtered$B)
Filtered$InAButNotB<-setdiff(Filtered$A,Filtered$B)

but when I try the last line I get the error replacement has 127 rows, data has 400 (if I use a longer dataset).

I suppose this is because I am only returning rows with differences so the columns don't match up. How do I NA the rows where there are no differences with setdiff so I can keep it in the original dataframe?

what package is the function Filtered in? I don't see it in base R. — Richard Lusch, Sep 29 '17 at 11:20
In your Filtered data frame, do you have missing values set as NA for the vectors of unequal length? — Richard Lusch, Sep 29 '17 at 11:26
Actually no @RichardLusch, I didn't know I could do that. Could you show me how? — Sebastian Zeki, Sep 29 '17 at 11:33
When I combine Test1 and Test2 as it stands in your code, I get the error: Error in data.frame(Test1, Test2) : arguments imply differing number of rows: 3, 4 I am wondering what you have done to overcome that in your actual data set. — Richard Lusch, Sep 29 '17 at 11:36
I will correct it as the original data set has equal numbers — Sebastian Zeki, Sep 29 '17 at 11:43
Once you correct it, your last line of code for Filtered$InAButNotB works. — Richard Lusch, Sep 29 '17 at 11:48

score 1 · Answer 1 · answered Sep 29 '17 at 11:41

The base R outer function will apply a function to each combination of each element of two vectors. So using outer with '==' would compare each element of each vector:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs")
Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy")

# test each element in Test1 for equality with each element in Test2
compare <- outer(Test1, Test2, '==') 

# calculate overlaps and uniques
overlaps <- sum(compare) # number of overlaps: 2
unique.test1 <- (rowSums(compare) == 0) # in Test1 but not Test2
unique.test2 <- (colSums(compare) == 0) # in Test2 but not Test1

# return uniques
OnlyInA <- Test1[unique.test1]
OnlyInB <- Test2[unique.test2]
same <- Test1[rowSums(compare) == 1]

# counts
n.unique.a <- sum(unique.test1)
n.unique.b <- sum(unique.test2)

Alternatively, the %in% operator is useful for this sort of thing as well:

Test1[Test1 %in% Test2]
[1] "Everything is normal"      "Its raining cats and dogs"

Test1[!(Test1 %in% Test2)]
[1] "It is all sunny"

Test2[!(Test2 %in% Test1)]
[1] "It is thundering" "Cloudy"

how can I use %in% to compare two columns of lists within a dataframe? — Sebastian Zeki, Sep 29 '17 at 11:47

score 0 · Answer 2 · answered Sep 29 '17 at 11:58

0

using tidyverse functions, you can try something like:

Filtered %>%
  summarise(comm = sum(Test1 %in% Test2),
            InA = sum(!(Test1 %in% Test2)),
            InB = sum(!(Test2 %in% Test1)))

Although, for dealing with vectors, if you're only interested in the aggregated count, you can try the following as well

length(intersect(Test1,Test2))
length(setdiff(Test1,Test2))

answered Sep 29 '17 at 11:58

Aramis7d

2,444
19
25

But how to I get the results back into the original dataframe without getting the error – Sebastian Zeki Sep 29 '17 at 12:02
1

what error? as you mention in the comments, if both columns have equal number of rows, it doesn't give you an error. If you're getting any other error message, please update the question accordingly. – Aramis7d Sep 29 '17 at 12:05
1

so the question is how to process if the columns do not have an equal number of rows? – Sebastian Zeki Sep 29 '17 at 12:32
if the original vectors are not equal in length, you can not run `Filtered<-data.frame(Test1,Test2)` . Please edit the question and provide a better example. You can refer to [this example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for guidance – Aramis7d Sep 30 '17 at 13:01

How to compare the number of statements that match between two string vectors

2 Answers2