Remove multiple instances with a regex expression, but not the text in between instances

Question

In long passages using bookdown, I have inserted numerous images. Having combined the passages into a single character string (in a data frame) I want to remove the markdown text associated with inserting images, but not any text in between those inserted images. Here is a toy example.

text.string <- "writing ![Stairway scene](/media/ClothesFairLady.jpg) writing to keep ![Second scene](/media/attire.jpg) more writing"

str_remove_all(string = text.string, pattern = "!\\[.+\\)")
[1] "writing  more writing"

The regex expression doesn't stop at the first closed parenthesis, it continues until the last one and deletes the "writing to keep" in between.

I tried to apply String manipulation in R: remove specific pattern in multiple places without removing text in between instances of the pattern, which uses gsubfn and gsub but was unable to get the solutions to work.

Please point me in the right direction to solve this problem of a regex removal of designated strings, but not the characters in between the strings. I would prefer a stringr solution, but whatever works. Thank you

Another option is `str_replace_all(text.string, "!.*?\\)", "")`, because you really only need to look for what you don't want since that seems to follow a solid pattern. That does leave you with double whitespace between substrings just like your example; this will make it a single whitespace: `str_replace_all(text.string, "!.*?\\) ", "")` — Kat, Sep 02 '21 at 16:10

score 2 · Accepted Answer · answered Sep 02 '21 at 15:37

2

You have to use the following regex

"!\\[[^\\)]+\\)"

alternatively you can also use this:

"!\\[.*?\\)"

both solution offer a lazy match rather than a greedy one, which is the key to your question

answered Sep 02 '21 at 15:37

koolmees

2,725
9
23

A word about how the use of the character class making the evaluation lazy in the first case would be a useful expansion, if I read the distinction correctly, as against `*?` in the second case. – Chris Sep 02 '21 at 16:23
Works perfectly. Lazy stops at the first instance, whereas greedy gobbles on and on. The .*? could thus be read as "remove all characters until you reach a, or the first )" ? – lawyeR Sep 02 '21 at 17:20
1

@lawyeR yes, the .*? will from left to right keep adding characters to it's match until it has found a ")". Afterwards it will look again for the next "![" and repeat the process. Lazy matches can be expensive tho, hence I also gave the first option (negated class). The negated class solution simply matches all characters that are not a ")" and keep everything from ")" to the next "![", making it a more efficient solution if you have a lot of data – koolmees Sep 03 '21 at 07:53

Anoushiravan R · Answer 2 · 2021-09-03T11:15:57.397

1

I think you could use the following solution too:

gsub("!\\[[^][]*\\]\\([^()]*\\)", "", text.string)

[1] "writing  writing to keep  more writing"

edited Sep 03 '21 at 11:15

answered Sep 02 '21 at 20:05

Anoushiravan R

21,622
3
18
41

Remove multiple instances with a regex expression, but not the text in between instances

2 Answers2