Looking for a regular expression to capture occurrences of a pattern and replace each instance with a different value in R

Question

I would like to find every occurrence of a string in a large body of text and replace the nth occurrence of that string with the nth element in an array of replacement strings.

I have a large text file of XML with the url/path of a particular image. This url occurs 1000 times in this file. I have an array of 1000 unique image paths that I would like to substitute into this text file.

The basic idea is: needle: IM_5sWQ4n0fUWh0jVH haystack: random XML..src=IM_5sWQ4n0fUWh0jVH...random XML... src=IM_5sWQ4n0fUWh0jVH... random XML... src=IM_5sWQ4n0fUWh0jVH...

Array of image url paths: replaceArray = {IM_5sWQ4n0fUWh0jVH, IM_31DS439u38, IM_8939cSd9321,...}

Goal: Replace first occurrence of IM_5sWQ4n0fUWh0jVH with the first element of replaceArray, replace the second occurrence of IM_5sWQ4n0fUWh0jVH with the second element of replaceArray, etc.

Desired output:
random XML..src=IM_5sWQ4n0fUWh0jVH...random XML... src=IM_31DS439u38... random XML... src=IM_8939cSd9321...

Does anyone have any idea how to go about doing this preferably in R? I've looked around the web a bit but haven't found the answer so far. Thanks in advance!

score 0 · Accepted Answer · answered Oct 02 '17 at 06:22

You could use sub in a loop. With sub you can search and replace the first instance of a pattern. (In general gsub is more useful, since it replaces all instances.)

Replacing Regex Matches in String Vectors

The sub function has three required parameters: a string with the regular expression, a string with the replacement text, and the input vector. sub returns a new vector with the same length as the input vector. If a regex match could be found in a string element, it is replaced with the replacement text. Only the first match in each string element is replaced. If no matches could be found in some strings, those are copied into the result vector unchanged.

df <- c("16_24cat 16_24cat", "25_34cat34343", "35_44cats 16_24cat33 16_24cat", "45_54Cat 16_24cat", "55_104fat")
ar <- c("mouse", "bear", "duck")  
x <- 1
while(x < 5) {
  df = sub(pattern = "cat", replacement = ar[x], df, ignore.case = TRUE,  perl=TRUE);
  x <- x+1;
}
df

Output:

"16_24mouse 16_24bear"
"25_34mouse34343"
"35_44mouses 16_24bear33 16_24duck"
"45_54mouse 16_24bear"
"55_104fat"

Looking for a regular expression to capture occurrences of a pattern and replace each instance with a different value in R

1 Answers1