0

I happened to be looking at this question, and I chanced upon this string:

#2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334

And I got interested in trying to replace only the commas (,) within the substring [#40,#221,#268,#281] with underscores (_). I was attempting this in R with the stringr package, and my idea was to use str_replace() as follows:

  • First locate the substring in the parent string with lookarounds: (?<=\\[).+(?=\\[). (I am using \\ to escape since that's what stringr uses.)
  • Then match all instances of only the commas within the substring with [^0-9#]+. So now the regex would be (?<=\\[)[^0-9#]+(?=\\[).
  • Now use str_replace() to replace the above matches with _ as follows: str_replace(mystring, "(?<=\\[)[^0-9#]+(?=\\[)", "_")
  • where mystring contains the string #2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334

I thought the regex I constructed should parse as: replace one or more characters that are not digits or # within the bounds of [ and ] with the character _. But evidently, this isn't the case as my attempt did not work.

Where am I going wrong and what is/are the right way(s) to solve regex problems of this kind?

tl;dr: how does one extract all tokens but a certain token (or set of tokens) from a substring bounded by two other arbitrary tokens?

Dunois
  • 1,813
  • 9
  • 22
  • 1
    `str_replace_all(string, "\\[[^\\]\\[]+]", function(x) gsub(",", "_", x, fixed=TRUE))` – Wiktor Stribiżew Feb 28 '20 at 00:04
  • I always forget that functions can be passed as arguments. The `gsub()` solution is neat, and it works with the lookarounds too!! – Dunois Feb 28 '20 at 00:17
  • 1
    Or if you are a bit more adventurous, using the Continue operator `\G` and smart use of `\K`: `(?:\[#\d+\K,|(?<!^)\G,)(#\d+)` https://regex101.com/r/ZQpgGa/1 – wp78de Feb 28 '20 at 01:03

0 Answers0