0

I would like to highlight SQL keywords that occur within a string in a syntax highlighter. Here are the rules I would like to have:

  • Match the keywords SELECT and FROM (others will be added, but we'll start here). Must be all-caps
  • Must be contained in a string -- either starting with ' or "
  • The first word in that string (ignoring whitespace preceding it) should be one of the keywords.

This of course is not comprehensive (can ignore escapes within a string), but I'd like to start here.

Here are a few examples:

  • SELECT * FROM main -- will not match (not in a string)
  • "SELECT name FROM main" -- will match
  • "
    SELECT name FROM main" -- will match
  • """Here is a SQL statement:

SELECT * FROM main""" -- no, string does not start with a keyword (SELECT...).

The only way I thought to do it in a single regex would be with a negative lookbehind...but then it would not be fixed width, as we don't know when the string starts. Something like:

But this of course won't work:

enter image description here

Would something like this be possible to do in a single regex?

shA.t
  • 16,580
  • 5
  • 54
  • 111
David542
  • 104,438
  • 178
  • 489
  • 842
  • [This](https://regex101.com/r/RjrqlR/1)? Or [this](http://regexstorm.net/tester?p=%28%3f%3c%3d%5b%22%27%5d%5cs*%28SELECT%29%5cs*%5c*%5cs*%29%28%3f%3aSELECT%7cFROM%29&i=%22SELECT+*+FROM+table%22%0d%0a%22%0d%0a+++++SELECT+*+FROM+table%22%0d%0a%0d%0a%22%22%22%0d%0aUse+this+query+here%3a+%22SELECT+*+FROM+table%22%0d%0a%22%22%22%0d%0a%0d%0aSELECT+*+FROM+table&o=m)? – Wiktor Stribiżew May 24 '20 at 22:45
  • @WiktorStribiżew yea, but only two should match, not all four (i.e., it's within a valid python string). – David542 May 24 '20 at 22:47
  • 3
    Not sure how you are going to do this without a proper parser. [This regex](https://regex101.com/r/gt5BqE/1) could help to some extent, but you would need to check groups that matched. – Wiktor Stribiżew May 24 '20 at 23:24
  • @WiktorStribiżew I see, yes that makes sense, thanks for the feedback on this. – David542 May 25 '20 at 05:30
  • `(?<=["']\s*(SELECT)\s*)(SELECT|FROM)` contains as variable width look behind, what regex engine is intended for ? –  Jun 03 '20 at 23:34

5 Answers5

3

A suitable regular expression is likely to get pretty complex, especially as the rules evolve further. As others have noted, it may be worth considering using a parser instead. That said, here is one possible regex attempting to cover the rules mentioned so far:

(["'])\s*(SELECT)(?:\s+.*)?\s+(FROM)(?:\s+.*)?\1(?:[^\w]|$)

Regular expression visualization

Online Demos

  1. Debuggex Demo
  2. Regex101 Demo

Explanation

As can be seen in the above visualisation, the regex looks for either a double or single quote at the start (saved in capturing group #1) and then matches this reference at the end via \1. The SELECT and FROM keywords are captured in capturing groups #2 and #3. (The (?:x|y) syntax ensures there aren't more groups for other choices as ?: at the start of a choice excludes it as a capturing group.) There are some further optional details such as limiting what is allowed between the SELECT and FROM and not counting the final quotation mark if it is immediately succeeded by a word character.

Results

SELECT * FROM tbl        -- no match - not in a string
"SELECT * FROM tbl"      -- matches - in a double-quoted string
'SELECT * FROM tbl;'     -- matches - in a single-quoted string
'SELECT * FROM it's      -- no match - letter after end quote
"SELECT * FROM tbl'      -- no match - quotation marks don't match
'SELECT * FROM tbl"      -- no match - quotation marks don't match
"select * from tbl"      -- no match - keywords not upper case
'Select * From tbl'      -- no match - still not all upper case
"SELECT col1 FROM"       -- matches - even though no table name
'  SELECT  col1  FROM '  -- matches - as above with more whitespace
'SELECT col1, col2 FROM' -- matches - with multiple columns

Possible Improvement?

It might also be necessary to exclude quotation marks from the "any character" parts. This can be done at the expense of increased complexity using the technique described here by replacing both instances of .* with (?:(?!\1).)*:

(["'])\s*(SELECT)(?:\s+(?:(?!\1).)*)?\s+(FROM)(?:\s+(?:(?!\1).)*)?\1(?:[^\w]|$)

See this Regex101 Demo.

Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
0

You could use capturing groups:

(.*["']\s*\K)(?(1)(SELECT|FROM).*(SELECT|FROM)|)

In this case $2 would refer to the first keyword and $3 would refer to the second keyword. This also only works if there are only two keywords and only one string on a line, which seems to be true in all of your examples, but if those restrictions don't work for you, let me know.

Charlie Armstrong
  • 2,332
  • 3
  • 13
  • 25
0

Just tested the regexp bellow:

enter image description here

If you need to add other commands the thing may get a little trick, because some keywords doesn't apply. Eg: ALTER TABLE mytable or UPDATE SET col = val;. For these scenarios you will need to create subgroups and the regexp may become slow.

Best regards!

Fausto Alonso
  • 1,046
  • 6
  • 8
0

If I understand your requirements well I suggest that:

/^'\s*(SELECT)[^']*(FROM)[^']*'|^"\s*(SELECT)[^"]*(FROM)[^"]*"/m

[Regex Fiddle Demo]

Explanation:

  • When you need to check start of a string; use ^.
  • When you need to accept 0-n spaces; use \s*.
  • When you need to accept new-line or multi-line strings; use m flag over your regex.
  • When you need to use Case-Sensitive mode; Don't use i flag over your regex.
  • When you need to block a string between a specific character like "; use [^"]* instead of .* that will protects first end of block.
  • When you need to have a block with similar start and end characters like ' & "; use ' '|" " instead of ['"] ['"].

Update:
If you need to capture any special keyword after verifying existence of SELECT keyword after start of your string, I can update my solution to this:

/^'\s*(SELECT)([^']*(SELECT|FROM))+|^"\s*(SELECT)([^"]*(SELECT|FROM))+/m
shA.t
  • 16,580
  • 5
  • 54
  • 111
0

without parsing of quoted strings
could be done using \G and \K construct

(?:"\s*(?=(?:SELECT|FROM))|(?<!^)\G)[^"]*?\K(SELECT|FROM)

demo