0
I want to compare two string vectors as follows:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs","Mild")

Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy")

Filtered<-data.frame(Test1,Test2)

Intended output:

Number the same: 2
Number present in Test1 and not in Test2: 2
Number present in Test2 and not in Test1: 2

I would also like to see which strings are different so that the other intended output should be as follows (and also part of the original dataframe)

Same<-c("Everything is normal","Its raining cats and dogs")
OnlyInA<-c("It is all sunny")
OnlyInB<-c("It is thundering","Cloudy")

I have tried:

Filtered$Same<-intersect(Filtered$A,Filtered$B)
Filtered$InAButNotB<-setdiff(Filtered$A,Filtered$B)

but when I try the last line I get the error replacement has 127 rows, data has 400 (if I use a longer dataset).

I suppose this is because I am only returning rows with differences so the columns don't match up. How do I NA the rows where there are no differences with setdiff so I can keep it in the original dataframe?

Sebastian Zeki
  • 6,690
  • 11
  • 60
  • 125

2 Answers2

1

The base R outer function will apply a function to each combination of each element of two vectors. So using outer with '==' would compare each element of each vector:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs")
Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy")

# test each element in Test1 for equality with each element in Test2
compare <- outer(Test1, Test2, '==') 

# calculate overlaps and uniques
overlaps <- sum(compare) # number of overlaps: 2
unique.test1 <- (rowSums(compare) == 0) # in Test1 but not Test2
unique.test2 <- (colSums(compare) == 0) # in Test2 but not Test1

# return uniques
OnlyInA <- Test1[unique.test1]
OnlyInB <- Test2[unique.test2]
same <- Test1[rowSums(compare) == 1]

# counts
n.unique.a <- sum(unique.test1)
n.unique.b <- sum(unique.test2)

Alternatively, the %in% operator is useful for this sort of thing as well:

Test1[Test1 %in% Test2]
[1] "Everything is normal"      "Its raining cats and dogs"

Test1[!(Test1 %in% Test2)]
[1] "It is all sunny"

Test2[!(Test2 %in% Test1)]
[1] "It is thundering" "Cloudy"    
jdobres
  • 11,339
  • 1
  • 17
  • 37
0

using tidyverse functions, you can try something like:

Filtered %>%
  summarise(comm = sum(Test1 %in% Test2),
            InA = sum(!(Test1 %in% Test2)),
            InB = sum(!(Test2 %in% Test1)))

Although, for dealing with vectors, if you're only interested in the aggregated count, you can try the following as well

length(intersect(Test1,Test2))
length(setdiff(Test1,Test2))
Aramis7d
  • 2,444
  • 19
  • 25
  • But how to I get the results back into the original dataframe without getting the error – Sebastian Zeki Sep 29 '17 at 12:02
  • 1
    what error? as you mention in the comments, if both columns have equal number of rows, it doesn't give you an error. If you're getting any other error message, please update the question accordingly. – Aramis7d Sep 29 '17 at 12:05
  • 1
    so the question is how to process if the columns do not have an equal number of rows? – Sebastian Zeki Sep 29 '17 at 12:32
  • if the original vectors are not equal in length, you can not run `Filtered<-data.frame(Test1,Test2)` . Please edit the question and provide a better example. You can refer to [this example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for guidance – Aramis7d Sep 30 '17 at 13:01