0

I want to convert text like this using Snowflake SQL:

A very(!) punctuated sentence! A complicated/confusing regex.

to this:

'A very (  !  )  punctuated sentence !  A complicated / confusing regex . '

Double spaces between punctuation is ok because I can do a second pass to compress whitespace. The punctuation list is

.,&-_()[]{}-/:;%$@!*|?~=+"\'

But if there is a standard shortcut for all punctuation I would consider that. I have seen answers that use Java regex that uses \p{Punct}. But in my tests I can't use the punctuation identifier and don't see it in the Snowflake docs.

I have a working version that makes my eyes bleed and it's not even fully written out:

select regexp_replace(
  'a very(!) punctuated sentence! A complicated/confusing regex?.', 
  '(\\(|\\)|\\/|!|\\?)', -- only addresses (), /, !, ?, not the full list
  ' \\1 '
) as "result" from table

result: "a very ( ! ) punctuated sentence ! A complicated / confusing regex ? ." 

For some reason there are not double spaces, which makes me question the result as well as the readability of the implementation

My understanding is that character classes are more performant and simpler to visually parse. But this doesn't work:

select regexp_replace(
  'a very(!) punctuated sentence! A complicated/confusing regex?.', 
  '[.,&-_()[]{}-/:;%$@!*|?~=+"\'`]', 
  ' \\1 '
) as "result" from table
-- Error: no argument for repetition operator: ?

It also doesn't seem that back references are available to character classes.

Is there a way to write this query that is relatively performant and allows the reader to easily visually parse the punctuation list such as in the character classes above?

Jeff
  • 939
  • 10
  • 19
  • `[.,&-_()[]{}-/:;%$@!*|?~=+"\'`]` The hyphen (`-`) should appear at the beginning plus it appears twice and should only appear once, shouldn't it? – Abra Aug 02 '20 at 04:17
  • Yes, for some reason I have two hyphens in there. But why would placing it at the beginning matter? – Jeff Aug 02 '20 at 13:49

2 Answers2

1

I see two potential problems in your current approach. First, the hyphen should appear last in the character class, or else it should be escaped. Currently, your character class has &-_, which means include every character in between & and _, probably not what you intended. Second, your regex doesn't actually have a first capture group. You could try replacing with \0, or just put the punctuation character into a first capture group and then use \1 as you already were doing.

SELECT REGEXP_REPLACE(
    'a very(!) punctuated sentence! A complicated/confusing regex?.',
    '([.,&_()[]{}-/:;%$@!*|?~=+"\'`-])', 
    ' \\1 '
) AS "result"
FROM yourTable;
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    A quote from the book _Mastering Regular Expressions_, 3rd edition: _The dashes are also not class metacharacters in this case because each is the first thing after [ or [^_ Doesn't that mean that a dash (i.e. hyphen) needs to be at the start of a character class and not at the end in order to be considered as a literal and not as a meta-character? – Abra Aug 02 '20 at 14:43
  • 1
    @AbraCadabra It might depend on the regex engine. Rule of thumb: escape dash or place it at the very end of the character class. – Tim Biegeleisen Aug 02 '20 at 14:50
  • I read up more on why to put it at the end of the character class. Thank you @Abra and Tim – Jeff Aug 02 '20 at 15:08
  • @TimBiegeleisen I got an error `Invalid regular expression: `[.,&_()[]{}-/:;%$@!*|?~=+"'\`-]', no argument for repetition operator: ?` when trying that. But that error and your explanation led me to the right answer which I'll post. Thank you – Jeff Aug 02 '20 at 15:11
  • @Jeff It might be better if you instead _edit_ my answer and correct it. – Tim Biegeleisen Aug 02 '20 at 15:16
0

This solution works great and is very readable:

select regexp_replace(
  'a very(!) punctuated sentence! A complicated/confusing regex?.',
  '[[:punct:]]',
  ' \\0 '
)

I got some errors when trying Tim's answer no argument for repetition operator: ? that got me to Snowflakes documentation of using POSIX basic and extended usage where [:punct:] is a valid character class. That character class covers all of the punctuation I had before, plus <>^# which works for my purposes.

Thank you Tim and Abra

Jeff
  • 939
  • 10
  • 19