3

I'm trying to use Snowflakes regex implementation, which I have just discovered is POSIX BRE/ERE. I had previously fashioned a regex expression to allow me to identify all commas not in double quoted string sections with a custom delimiter (for text file parsing).

Sample text string:

"Foreign Corporate Name Registration","99999","Valuation Research",,"Active Name",02/09/2020,"02/09/2020","NEVADA","UNITED STATES",,,"123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES","123 SOME STREET",,"MILWAUKEE","WI","53202","UNITED STATES",,,,,,,,,,,,

Regex command and substitution (working in regex101.com):

([("].*?["])*?(,)
\1#^#

Regex101.com (and desired) result:

"Foreign Corporate Name Registration"#^#"99999"#^#"Valuation Research"#^##^#"Active Name"#^#02/09/2020#^#"02/09/2020"#^#"NEVADA"#^#"UNITED STATES"#^##^##^#"123 SOME STREET"#^##^#"MILWAUKEE"#^#"WI"#^#"53202"#^#"UNITED STATES"#^#"123 SOME STREET"#^##^#"MILWAUKEE"#^#"WI"#^#"53202"#^#"UNITED STATES"#^##^##^##^##^##^##^##^##^##^##^##^#

So, given that I am now belatedly discovering that I cannot use lazy quantifiers, can any uber-regex'ers advise on how I might alter my expression to return the same result while being compliant with POSIX BRE/ERE?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
CaseyR
  • 75
  • 6
  • Did you try `("[^"]*")*,`? – Wiktor Stribiżew Sep 02 '20 at 17:14
  • @WiktorStribiżew - I did not! With a small modification ```("[^"]*")*(,)```, that works perfectly! Sir, thank you very much!! And I can't work out how to give you credit for it, I assume because its a comment - sorry :( – CaseyR Sep 03 '20 at 08:28
  • But why are you capturing the comma? You are not using the second group, you have `\1#^#` in the replacement. – Wiktor Stribiżew Sep 03 '20 at 08:29
  • The comma is actually the character being replaced, my (weak) understanding is that the first group is negating text within the quotes. With your regex, I get: ```"Foreign Corporate Name Registration"#^##^#,"99999"#^##^#,"Valua...``` with the addition of the second group I get the desired: ```"Foreign Corporate Name Registration"#^#"99999"#^#"Valua...``` – CaseyR Sep 03 '20 at 08:39
  • No, the group saves the captured text in a separate memory buffer and backreferences like `\1`, `\2`, etc. are sheer placeholders for those matches. – Wiktor Stribiżew Sep 03 '20 at 08:40

1 Answers1

0

You need to

  • Convert the lazy quantifiers into greedy here as they will still match in the same way as with lazy quantifiers
  • [("] matches ( or ", you need to only match " with this character class, use " only.

The final POSIX ERE expression will look like

("[^"]*")*(,)

It matches

  • ("[^"]*")* - zero or more occurrences of ", one or more chars other than " and then a " (Group 1)
  • (,) - a comma (Group 2)

NOTE: POSIX BRE expression will look like \("[^"]*"\)*\(,\) where capturing groups are defined with a pair of escaped parentheses.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Great explanation, time for me to head to RegEx101 - thank you Wiktor! – CaseyR Sep 03 '20 at 08:44
  • @CaseyR You should watch out for incompatibility between all regex flavors supported at regex101.com and POSIX BRE/ERE. Also, see [this thread](https://stackoverflow.com/questions/18514135/bash-regular-expression-cant-seem-to-match-any-of-s-s-d-d-w-w-etc). – Wiktor Stribiżew Sep 03 '20 at 08:51
  • I'll happily upvote your answer, but your phrasing is weird. Surely if the *question* is good it could still have answers which deserve downvotes? Though of course that's not the case here really. – tripleee Sep 09 '20 at 07:48