1

I'm working on a piece of Python code using regular expressions and I'm trying to achieve the following:

If I have a piece of SQL as a string that contains some code in [] and a WHERE clause in front of it, I want to remove the complete WHERE clause. For example:

where this condition and [this = 1] group by 1,2,3

becomes

group by 1,2,3

The code I'm using is:

txt = """where [this = 1] group by"""

txt = re.sub("where.*\[.*\].*group" , 'group', txt, flags = re.S|re.I)    

However, if I have another WHERE clause before this one the whole regex doesn't work as expected, e.g.:

txt = """where that and that do that where [this = 1] group by 1,2,3"""

txt = re.sub("where.*\[.*\].*group" , 'group', txt, flags = re.S|re.I)  

produces

group by 1,2,3

instead of

where that and that do that group by 1,2,3

Edit: the solution should also work for a scenario like this:

txt = """where that and that do that where [this = 1] and [that = 1] group by 1,2,3"""

outputs:

"""where that and that do that group by 1,2,3"""

So it removes the inner (closest to []) WHERE clause and all code containing at least one [] until the next GROUP, ORDER or end of string.

Finally, the solution needs to takes care of cases in which there are multiple such where .. [...] snippets in the string.

txt = """where [Include_Fakes|TRUE] group by 1 order by 1,3 ) where signed_up_date >= dateadd('[aggregation]', -[Last_N_Periods|12], CURRENT_DATE)"""

expected output:
group by 1 order by 1,3 ) 

Can somebody point me in the right direction here? Any help would be appreciated.

Roy2012
  • 11,755
  • 2
  • 22
  • 35
john_jerome
  • 276
  • 1
  • 2
  • 8

3 Answers3

3

You can use negative lookaheads to find the last possible match:

>>> import re
>>> re.sub(r"where((?!where).)*?]\s?", "", "where that and that do that where [this = 1] group by 1,2,3")
'where that and that do that group by 1,2,3'
>>> re.sub(r"where((?!where).)*?]\s?", "", "where this condition and [this = 1] group by 1,2,3")
'group by 1,2,3'

Demo

L3viathan
  • 26,748
  • 2
  • 58
  • 81
  • this wouldn't work if I had multiple `[]`, i.e. `re.sub(r"where((?!where).)*?]\s?", "", "where this condition and [this = 1] and [that = 1] group by 1,2,3")` doesn't produce the desired output – john_jerome Jul 03 '20 at 08:45
  • If you want to greedily match until the last `]`, just change the `*?` into a `*`. – L3viathan Jul 03 '20 at 09:46
1

Here's a way to do that.

exp =r"(where((?!where).)*\[.*?\].*?(?=(group|order)))|(where((?!where).)*\[.*?\].*$)"

txt = """where that and that do that where [this = 1] and [that = 1] group by 1,2,3"""
print(re.sub(exp, "", txt))
# ==> where that and that do that group by 1,2,3

txt = """where that and that do that where [this = 1] group by 1,2,3"""
print(re.sub(exp, "", txt))
# ==> where that and that do that group by 1,2,3

txt = """lots of code where signed_up_date >= dateadd('[aggregation]', -[Last_N_Periods|12], CURRENT_DATE)""" 
print(re.sub(exp, "", txt))
# ==> lots of code 

txt = """where [Include_Fakes|TRUE] group by 1 order by 1,3 ) where signed_up_date >= dateadd('[aggregation]', -[Last_N_Periods|12], CURRENT_DATE)"""
print(re.sub(exp, "", txt))

# ==> group by 1 order by 1,3 ) 

txt =  """where [condition1] group by 1) where [condition2] group by 2"""
print(re.sub(exp, "", txt))

# ==> group by 1) group by 2
Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • this wouldn't work if I had multiple `[]`, i.e. `re.sub(r"where((?!where).)*?]\s?", "", "where this condition and [this = 1] and [that = 1] group by 1,2,3")` doesn't produce the desired output – john_jerome Jul 03 '20 at 08:47
  • 2
    It would be great if the question would describe this scenario, with an example and expected output. – Roy2012 Jul 03 '20 at 08:50
  • this seems almost right - however, it somehow doesn't work in this case `real_code = """lots of code where signed_up_date >= dateadd('[aggregation]', -[Last_N_Periods|12], CURRENT_DATE)""" reg = re.sub(r"where\s+\[.*\]\s.*", "", real_code, flags = re.S|re.I)` – john_jerome Jul 03 '20 at 09:31
  • What would you like to see in this case? The bracket doesn't come immediately after the "where" statement. – Roy2012 Jul 03 '20 at 09:38
  • "lots of code" would be the desired output. In my original question by 'in front of it' I didn't imply that 'WHERE' should be immediately in front of an expression '[]'. Sorry if that was not clear. – john_jerome Jul 03 '20 at 09:44
  • Some of the code you want to remove in this case is not in brackets. Is that on purpose? – Roy2012 Jul 03 '20 at 09:48
  • Yes. More generally, if a 'WHERE' clause contains at least one expression in brackets, I want to remove the complete 'WHERE' clause. In my case, 'WHERE' clause is limited by 'GROUP', 'ORDER' or end of string. That's what I'm aiming for. – john_jerome Jul 03 '20 at 09:50
  • So ... to make sure I get it right. The part you'd like to remove (1) begins with a 'where'; (2) contains some text something in brackets. (3) ends with group by / order by / the end of the string. Correct? And a question - what happens if you have 'group by' within the brackets? – Roy2012 Jul 03 '20 at 09:52
  • Yes, that is correct. 'group by' can't be inside the squared brackets as '[ ]' can only contain a filter which will be inside a where clause. Another example: `some code where field1 = 'hello' and [created_at::DATE=daterange] group by 1 some more code` becomes `some code group by 1 some more code` – john_jerome Jul 03 '20 at 09:57
  • I think I just found a counter-example for when this regex doesn't work as expected: ""where [Include_Fakes|TRUE] group by 1 order by 1,3 ) where signed_up_date >= dateadd('[aggregation]', -[Last_N_Periods|12], CURRENT_DATE)"" "should end up with group by 1 order by 1,3" but it produces an empty string - so it seems that the double appearance of "where ... []" causes an issue – john_jerome Jul 03 '20 at 14:42
  • Looking at it. This means that you can have multiple instances of where followed by brackets, each ending with group / order / end of expression. Right? – Roy2012 Jul 04 '20 at 05:51
  • I believe your last version only works if the second occurrence of "where [condition]" is followed by end of string expression. If you simply add 'group by' at the end of the last "txt" it returns 'group by' instead of "group by 1 order by 1,3 ) group by", so the regex basically skips the inner clause. At this point I'm not really sure that I understand how to fix this... – john_jerome Jul 06 '20 at 06:23
  • similar happens in this case: real_code = """WHERE text where [aggregation] group by 1 order by 1 desc""" print(re.sub(exp, r"", real_code, flags = re.S|re.I)) >> group by 1 order by 1 desc – john_jerome Jul 06 '20 at 06:25
  • Hmm. I'll look into it. – Roy2012 Jul 06 '20 at 06:55
  • As to the last one, it seems to be fine. There are TWO where expressions. One with brackets, which is removed, and the other without - and this one stays there. – Roy2012 Jul 06 '20 at 06:58
  • you're right. But the regex removed BOTH 'where' while it should only touch the one that is followed by [aggregation], correct? so "where text where [aggregation] group by 1 order by 1 desc" becomes "where text group by 1 order by 1 desc" – john_jerome Jul 06 '20 at 09:04
  • did you have a chance to look into it? I'm kinda stuck why it doesn't remove consequent occurrences of 'where [filter]'... – john_jerome Jul 07 '20 at 06:32
  • This is the output I'm seeing: 'where text group by 1 order by 1 desc'. What's wrong with it? – Roy2012 Jul 07 '20 at 06:35
  • Sorry, I was referring to my first comment from yesterday: txt = """where [condition1] group by 1) where [condition2] group by 2""" returns group by 2 instead of "group by 1) group by 2", so it "swallows" the inner part – john_jerome Jul 07 '20 at 06:43
1

You can match the last occurrence of where using a tempered greedy token and then match any char except [ or ].

Then repeat matching from an opening till closing [ ] and repeat that match 1+ times.

\bwhere(?:(?:(?!where)[^][])*\[[^][]*])+\s*
  • \bwhere Word boundary and match where
  • (?: Non capture group
    • (?: Non capture group
      • (?!where)[^][] Match any char except [ or ] if on the right is not where
    • )* Close group and repeat 0+ times
    • \[[^][]*] Match 0+ times any char except [ or ]
  • )+ Close group and repeat 1+ times to match at least one time [...]
  • \s* Match 0+ whitespace chars

Regex demo | Python demo

Example code

import re
 
regex = r"\bwhere(?:(?:(?!where)[^][])*\[[^][]*])+\s*"
txt = ("where this condition and [this = 1] group by 1,2,3\n"
    "where that and that do that where [this = 1] and [that = 1] group by 1,2,3")
result = re.sub(regex, "", txt)
 
print (result)

Output

group by 1,2,3
where that and that do that group by 1,2,3
The fourth bird
  • 154,723
  • 16
  • 55
  • 70