searching text in data frames in R

Question

Have 2 different data frames in R

A - data set has below data

cat
dog
Rat
Parrot
Tiger

B - data set has below data

Give milk to cat
dog bites
life span of dog is 10 years
Cow gives us milk
Tiger have huge Jaws

Now, R code has to check for entire B data for each and every value in data set A.

Try `grepl(paste(A$col, collapse="|"), B$col, ignore.case = TRUE)` — akrun, May 24 '18 at 05:26
What is your expected output, based on the `grepl`, I get first 3 as TRUE and others FALSE. If you need it to be more accurate, `grepl(paste0("\\b(", paste(A$col, collapse="|"), ")\\b"), B$col, ignore.case = TRUE)` — akrun, May 24 '18 at 05:38
For me it is showing 3 statement as false.. and Assume if have 100 rows in A data and 450 in B data set, will that work ? — Murali, May 24 '18 at 05:39
I copied your example, and it shows me correct output. Try with the updated code above — akrun, May 24 '18 at 05:40
have changes the last statement now you please try with your code... — Murali, May 24 '18 at 05:54

MKR · Accepted Answer · 2018-05-30T11:30:57.233

1

An option is to use apply and find every word in df_A if present in df_B. The expected format is not clearly specified by OP. The word from df_A which are found can be listed using unlist and unique on final output.

library(dplyr)
apply(df_B,1, function(x){
  df_A$Word[(df_A$Word %in% strsplit(x, split=" ")[[1]])]
}) %>% unlist() %>% unique()
#[1] "cat"   "dog"   "Tiger"

#If objective is to find which row in B contains at least a word from df_A then:
df_B$Have_A <- mapply(function(x){
  any(df_A$Word %in% strsplit(x, split=" ")[[1]])
}, df_B$Text)

df_B
#                           Text Have_A
# 1             Give milk to cat   TRUE
# 2                    dog bites   TRUE
# 3 life span of dog is 10 years   TRUE
# 4            Cow gives us milk  FALSE
# 5     Cow have huge advantages   TRUE

Data:

df_B <- read.table(text =
"Text 
'Give milk to cat'
'dog bites'
'life span of dog is 10 years'
'Cow gives us milk'
'Tiger have huge Jaws'",
header = TRUE, stringsAsFactors = FALSE)



df_A <- read.table(text =
"Word 
cat
dog
Rat
Parrot
Tiger",
header = TRUE, stringsAsFactors = FALSE)

edited May 30 '18 at 11:30

answered May 24 '18 at 05:43

MKR

19,739
4
23
33

Thank you MKR for your answer, let us assume if want to add some value, like df_B has df_A then yes or no and changed the data set B can you please have look.. – Murali May 24 '18 at 05:53
@Murali Yes, you can even do that in my 2nd option. Let me update the answer. – MKR May 24 '18 at 05:57
Thank you MKR for your answer, see this error when try execute the code Error in apply(df_B, 1, function(x) { : dim(X) must have a positive length. can you please tell me which version of R you are using.. let me know if there is a issue – Murali May 24 '18 at 07:59
Hi MKR, Have check with the error. – Murali May 30 '18 at 11:08
@Murali My R version `3.4.2 (2017-09-28)`. I think you got a vector defined as `df_B` that can cause error mentioned by you. I have changed my answer to use `mapply`. You can try that. You should not get error with `mapply`. – MKR May 30 '18 at 11:32

akrun · Answer 2 · 2018-05-24T06:01:44.260

1

We can paste the elements of the column in the 'A' dataset and use that as pattern in grepl to get a logical vector by checking with the strings in 'B' dataset column

i1 <- grepl(paste0("\\b(", paste(A$col, collapse="|"), ")\\b"),
      B$col, ignore.case = TRUE)
i1
#[1]  TRUE  TRUE  TRUE FALSE TRUE

B$col[i1]

data

A <- structure(list(col = c("cat", "dog", "Rat", "Parrot", "Tiger"
)), .Names = "col", class = "data.frame", row.names = c(NA, -5L
))

B <- structure(list(col = c("Give milk to cat", "dog bites", 
  "life span of dog is 10 years", 
 "Cow gives us milk", "Tiger have huge Jaws")), .Names = "col",
 class = "data.frame", row.names = c(NA, -5L))

edited May 24 '18 at 06:01

answered May 24 '18 at 05:47

akrun

874,273
37
540
662

Akrun, have changed the data can you please check with that once... – Murali May 24 '18 at 05:58
@Murali Updated, now the last one will be TRUE – akrun May 24 '18 at 06:01

searching text in data frames in R

2 Answers2

data