0

My intention is: (1) to extract multiwords/strings (from data1), (2) to replace those extracted words by other strings located in another dataset (data2). To make it clear, the objective is replacing mult1 by mult2 after mining mult1 from data1.

library(sringi)
library(stringr)
data1 <- data.frame(id=c(1,2,3), 
          text=c("This is text mining exercise text",
                 "Text analysis is bit confusing analyssi",
                 "Hint on this text analysis?")) 
data2 <- data.frame(mult1 = c("text","analysis","bit confusing"),
          mult2 = c("A; B; C","A; D", "A; B; C; D"))
txt <- subf <- list()
for(i in 1:length(data1$id)){ 
    txt[i] <- str_extract_all(data1$text[i],str_c(data2$mult1,collapse="|")) #this works fine
    subf[i] <- str_replace_all(txt[i],data2$mult2[i]) #here is my problem
}

For intance, txt[1] give:

[1] "text" "text"

The corresponding string for text is "A; B; C" in this case.What I'm looking for is the code that can produce an ouput like:

"A; B; C" "A; B; C"

Any help is highly appreciated. Tnx!

phiver
  • 23,048
  • 14
  • 44
  • 56
iGada
  • 599
  • 3
  • 9

1 Answers1

0

I believe that this is a problem of the type described in this famous question.

txt <- subf <- list()
pattern <- tolower(str_c(data2$mult1, collapse = "|"))
for(i in 1:length(data1$id)){ 
  txt[[i]] <- unlist(str_extract_all(tolower(data1$text[i]), pattern))
  j <- match(txt[[i]], data2$mult1)
  subf[[i]] <- data2$mult2[j]
}
subf
#[[1]]
#[1] "A; B; C" "A; B; C"
#
#[[2]]
#[1] "A; B; C"    "A; D"       "A; B; C; D"
#
#[[3]]
#[1] "A; B; C" "A; D"   
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • You nailed it. Big thanks! Is that possible to produce a frequency `table` after removing semicolon for `subf[[i]]`? – iGada May 20 '20 at 14:55
  • @Gadaa A one-liner `table(trimws(unlist(strsplit(unlist(subf), ';'))))`? – Rui Barradas May 20 '20 at 16:57
  • It gives an error which says `non-character argument`. When I use `as.character`, it produces an output which is something different. I appreciate it if you check it back for me. – iGada May 20 '20 at 20:13
  • @Gadaa Can you post the output you are getting? – Rui Barradas May 20 '20 at 21:38
  • Sure! `Error in strsplit(unlist(subf), ";") : non-character argument`. – iGada May 20 '20 at 21:43
  • @Gadaa `strsplit(as.character(unlist(subf)), ";")`. But this becomes more and more complicated and unreadable, it would be better to break the code into 2 or 3 lines. – Rui Barradas May 21 '20 at 01:55
  • I checked back the original code you drafted. The code is conceptually right. But the line `subf[[i]] <- rep(data2$mult2[i], length(txt[[i]]))` consider only the first element of `txt[[i]]`. Would kindly check it & try to include the other elements of `txt[[i]]`, please? Tnx! – iGada May 21 '20 at 18:10
  • @Gadaa I believe I've got it, see the edit. I have also changed the code to match strings case-insensitive by converting to lower case first. – Rui Barradas May 22 '20 at 10:34
  • 1
    Many thanks for being concerned. That's perfectly what I need! – iGada May 22 '20 at 12:17
  • I come to you once again! The value of `j` kept fixed & do not change in each iteration as expected. This has an impact on the `subf[[i]] <- data2$mult2[j]`. Would you kindly check it once again, please? – iGada May 24 '20 at 15:51