0

I have a long character vector (around 1800 elements, each element has one word = replace), which I want to replace with another long character vector (same number of elements, but a tag is added to the previous vector = replacewith). Texts is a readtext dataframe. I need to preserve the texts in the dataframe, as I need their tags to export them to .txt files afterwards.

for(i in 1:length (replace)){
  text<-gsub(replace, replacewith, texts, perl=T) 
  }

My only replaces the first word on the vector (replace) with the second vector (replacewith) - but I want the loop to run over the entire vector.

I realised I have not told R that it should reiterate over replace, but how do I tell it that it should also reiterate over replacewith and not just replace it with the first entry?

for(i in 1:length (replace)){
  text<-gsub(i, replacewith, texts, perl=T) 
  }

I am putting all of my code below to include a reproducible example:

library(writexl)
library(stringr)
library(tidyverse)
library(readtext)
library(tibble)

texts<-readtext("~/Desktop/taskdesc/texts/*.txt", encoding="UTF-8")
tasks<-readtext::readtext("~/Desktop/taskdesc/tasks/*.docx", encoding="UTF-8")

#I am getting rid of the row names and preparing the replace and replacewith vectors 

replace<-unlist(replace)
replace<-data.frame(replace)
replace<-replace$replace

replacewith<-paste(replace, "<TASK>")

##replace and replacewith look as follows:
 
show (replace) 
[1] about           other           school          great           present         people         

show (replacewith)
[1] "about <TASK>"           "other <TASK>"           "school <TASK>"          "great <TASK>"   

#for loop to replace each of the words from the replace list in the texts with the replacewith

for(i in 1:length (replace)){
  text<-gsub(i, replacewith, texts, perl=T) 
  }

HeHa
  • 1
  • 3
  • 4
    You seem to be overwriting `text` on every iteration. Also you probably don't need a loop to do this - both `stringi` and `stringr` libraries have a vectorised `gsub()` equivalent which will be much faster. Can you post a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with 5-10 words to illustrate the exact requirements? – SamR Dec 14 '22 at 08:02
  • Thank you so much! I will make sure to try that out! I am not able to share any of the texts, but I will add a reproducible example in the body of the question. – HeHa Dec 14 '22 at 18:31
  • 1
    Please give a sample of `text` as well so we can actually run the code. It doesn't have to be your real data. Just make up something that's similar to your real data. 3-5 sample text entries should be plenty. – Gregor Thomas Dec 14 '22 at 19:20

1 Answers1

0

Thanks for updating the question. It would be useful to see texts but I'll assume that it's a character vector of lines, so here are a few:

replace  <- c("about", "other", "school", "great", "present","people")
replacewith<-paste(replace, "<TASK>")

texts  <- c(
    "I went to school",
    "I hate buying presents for other people",
    "What's this about",
    "My mother was here"
)

Note that in my last sentence I have the word "mother". I assume that you do not want "other" to match this. In that case, add word boundaries to your pattern:

replace  <- paste0("\\b", replace, "\\b")

If the assumption is wrong you can skip this step. Then as you have a 1:1 match of patterns:replacements, you can use stringi::stri_replace_all_regex():

stringi::stri_replace_all_regex(
    texts,
    replace,
    replacewith,
    vectorize_all = FALSE
)

# [1] "I went to school <TASK>" "I hate buying presents for other <TASK> people <TASK>" "What's this about <TASK>"
# [4] "My mother was here"

Note if you skipped adding word boundaries, you can change instead use stringi::stri_replace_all_fixed(). It will give the same output as the regex approach but should be slightly faster.

SamR
  • 8,826
  • 3
  • 11
  • 33