1

I am new to regex and trying to split on the basis of (and/or) as delimiters

I used the solution provided in : https://stackoverflow.com/a/18893443/5164936

and modified my regex as :

re.split(r'(\s+and\s+|\s+or\s+)(?=(?:[^"]*"[^"]*")*[^"]*$)', s)

which works like a charm for majority of my use cases except for following input:

'col1 == "val1" or col2 == \'val1 and " val2\''

the split fails for this particular case and I have tried modifying the above regex with different combination with no luck. Can someone please help fix this regex.

Harsh Bafna
  • 2,094
  • 1
  • 11
  • 21
  • thanks a ton @WiktorStribiżew this resolves my issue. Will it be possible to break down your regex with some explanation on how it works? – Harsh Bafna Aug 28 '18 at 14:03

1 Answers1

1

You may use a PyPi regex based solution:

import regex

s = 'col1 == "val1" or col2 == \'val1 and " val2\''
res = regex.split(r'''(?V1)(?:"[^"]*"|'[^']*')\K|(\s+(?:and|or)\s+)''', s)
print([x for x in res if x])
# => ['col1 == "val1"', ' or ', 'col2 == \'val1 and " val2\'']

See the Python demo online.

Details

  • (?V1) - flag that allows splitting at zero length matches
  • (?:"[^"]*"|'[^']*')\K - a substring in between double or single quotation marks that is discarded from the match value using the \K match reset operator (thus, when this pattern matches, the match is an empty string)
  • | - or
  • (\s+(?:and|or)\s+) - 1+ whitespaces, and or or and again 1+ whitespaces.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I am stuck again with this regex. This regex doesn't work for the input : "col2 == 'val1 and val2' ". It splits the string on the "and" word in between the single quotes. Could you please help? – Harsh Bafna Nov 16 '18 at 07:23
  • 1
    @HarshBafna The only correct way I see it is with PCRE pattern like `(?:"[^"]*"|'[^']*')(*SKIP)(*F)|(\s+and\s+|\s+or\s+)` (see [demo](https://regex101.com/r/sz5juc/2)). In Python, you may only do something like I posted above. Let me know if it works for you – Wiktor Stribiżew Nov 16 '18 at 09:05
  • Thanks Wiktor, highly appreciated. I will try this out. Also, do you mean that it is not possible with the inbuilt re module? – Harsh Bafna Nov 16 '18 at 09:31
  • I am not too good with such complex expressions. Can you suggest some good material from where I can learn regex? – Harsh Bafna Nov 16 '18 at 09:31
  • 1
    @HarshBafna I do not know your level of regex knowledge :) so that I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). Also, [rexegg.com](http://rexegg.com) is worth having a look at. – Wiktor Stribiżew Nov 16 '18 at 09:31
  • You can say I am a beginner. Mostly working in big-data domain and never got a good chance to do hands-on with regex. Thanks for those links. I will go through them whenever I get some time. Till then I will keep disturbing you like this :D – Harsh Bafna Nov 16 '18 at 09:33
  • @HarshBafna I will check if it is possible with `re` and will let you know. – Wiktor Stribiżew Nov 16 '18 at 09:34
  • the above regex returns extra None and blank values in the list for basic split. – Harsh Bafna Nov 16 '18 at 09:41
  • import regex s = "col1='test'" regex.split(r'''(?V1)(?:"[^"]*"|'[^']*')\K|(\s+(?:and|or)\s+)''', s) ["col1='test'", None, ''] – Harsh Bafna Nov 16 '18 at 09:41
  • @HarshBafna That is why I have `[x for x in res if x]` in my code in the answer. – Wiktor Stribiżew Nov 16 '18 at 09:47
  • Works for majority of my test cases except for where the data has extra double quotes. for example "col1='McDonal\"s'". – Harsh Bafna Nov 16 '18 at 10:15
  • @HarshBafna No idea what you mean. Are there any escaped quotes in your data? – Wiktor Stribiżew Nov 16 '18 at 10:19
  • yes. it can be like "col1='McDonad\"s'" , "col1='McDonald's" , 'col1="McDonand"s', 'col1="McDonald\'s"'. – Harsh Bafna Nov 16 '18 at 10:25
  • the earlier regex you provided based on python re, took care of all these scenarios. – Harsh Bafna Nov 16 '18 at 10:25
  • @HarshBafna I can't see where the issue is since the strings you provided are not quite ready to test against. Can you share a fiddle? Also, try [**this code**](https://rextester.com/OGUXV83761). – Wiktor Stribiżew Nov 16 '18 at 10:27
  • @HarshBafna The `'brand = "McDonald\"s"'` string literal defines a `brand = "McDonald"s"` literal string. There are 3 unescaped double quotes. There is no way to know where the split should occur in these situations. – Wiktor Stribiżew Nov 16 '18 at 10:51
  • your earlier solution handles this situation :-). re.split(r'(\s+(?:and|or)\s+)(?=(?:[^"]*"[^"]*")*[^"]*$|(?:[^\']*\'[^\']*\')*[^\']*$)', s) – Harsh Bafna Nov 16 '18 at 10:53
  • @HarshBafna That was still wrong, it did not work correctly. The https://rextester.com/KGHNZ77167 shows me what is not working well. OK, seems like a parser should be much better here. Let me some time to fix it. – Wiktor Stribiżew Nov 16 '18 at 10:58
  • I meant the solution using python re. The only case it didn't handle for me was : "collegename != 'A and B'" – Harsh Bafna Nov 16 '18 at 11:01