4

I'd like to split the following string by the word 'and' except when the word 'and' is within quotes

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

Desired Result

["section_category_name = 'computer and equipment expense'","date >= 2015-01-01","date <= 2015-03-31"]

I can't seem to find the correct regex pattern that splits the string correctly so that 'computer and equipment expense' is not split.

Here's what I tried:

re.split('and',string)

Result

[" section_category_name = 'computer "," equipment expense' ",' date >= 2015-01-01 ',' date <= 2015-03-31']

As you can see the result has split 'computer and equipment expense' into different items on the list.

I've also tried the following from this question:

r = re.compile('(?! )[^[]+?(?= *\[)'
               '|'
               '\[.+?\]')
r.findall(s)

Result:

[]

I've also tried the following from this question

result = re.split(r"and+(?=[^()]*(?:\(|$))", string)

Result:

[" section_category_name = 'computer ",
 " equipment expense' ",
 ' date >= 2015-01-01 ',
 ' date <= 2015-03-31']

The challenge is that the prior questions on this topic do not address how to split a string by a word within quotes, since they address how to split a string by a special character or a space.

I was able to get the desired result if I modified the string to the following

string = " section_category_name = (computer and equipment expense) and date >= 2015-01-01 and date <= 2015-03-31"
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)

Desired Result

[' section_category_name = (computer and equipment expense) ',
 ' date >= 2015-01-01 ',
 ' date <= 2015-03-31']

However I need the function to not split on 'and' within apostrophes instead of parenthesis

Community
  • 1
  • 1
Chris
  • 5,444
  • 16
  • 63
  • 119
  • I've tried all of the above solutions and tried to alter them to be able to split on the word 'and' with little luck. I'll continue to post everything I tried above – Chris Dec 23 '15 at 22:01
  • 2
    Short form: Regular expressions are a poor tool for the job at hand. This is one of those places where one should really build a real parser. – Charles Duffy Dec 23 '15 at 22:46

6 Answers6

1

You can use the following regex with re.findall:

((?:(?!\band\b)[^'])*(?:'[^'\\]*(?:\\.[^'\\]*)*'(?:(?!\band\b)[^'])*)*)(?:and|$)

See the regex demo.

The regular expression consists of an unwrapped sequences of either anything but a ' up to the first and (with the tempered greedy token (?:(?!\band\b)[^'])*) and anything (supporting escaped entities) between and including single apostrophes (with '[^'\\]*(?:\\.[^'\\]*)*' - which is also an unwrapped version of ([^'\\]|\\.)*).

Python code demo:

import re
p = re.compile(r'((?:(?!\band\b)[^\'])*(?:\'[^\'\\]*(?:\\.[^\'\\]*)*\'(?:(?!\band\b)[^\'])*)*)(?:and|$)')
s = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
print([x for x in p.findall(s) if x])
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

You can use re.findall to generate a list of 2-tuples where the first element is either a quoted string or empty, or the second element is anything except white space characters or empty.

You can then use itertools.groupby to partition by the word "and" (when not in a quoted string), then rejoin from the populated elements inside a list-comp, eg:

import re
from itertools import groupby

text = "section_category_name = 'computer and equipment expense'      and date >= 2015-01-01 and date <= 2015-03-31 and blah = 'ooops'"
items = [
    ' '.join(el[0] or el[1] for el in g)
    for k, g in groupby(re.findall("('.*?')|(\S+)", text), lambda L: L[1] == 'and')
    if not k
]

Gives you:

["section_category_name = 'computer and equipment expense'",
 'date >= 2015-01-01',
 'date <= 2015-03-31',
 "blah = 'ooops'"]

Note that whitespaces are also normalised outside the quoted string - whether that's desirable or not though...

Also note - this does allow a bit of flexibility in grouping, so you could change lambda L: L[1] == 'and' to be lambda L: L[1] in ('and', 'or') to group on different words if needs be etc...

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
0

If all your strings follow the the same pattern, you can use regex to divide it the search into 3 groups. First group from beginning and to the last '. Then the next group is everything between the first and last "and". And the last group the rest of the text.

import re

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

output = re.match(r"(^.+['].+['])\sand\s(.+)\sand\s(.+)", string).groups()
print(output)

Each group is defined inside the parentheses in the regex. The square brackets defines a specific character to match. This example will only work as long as "section_category_name" equals something inside single quotation marks.

section_category_name = 'something here' and ...
oystein-hr
  • 551
  • 4
  • 9
0

The following code will work, and doesn't need crazy regex to make it happen.

import re

# We create a "lexer" using regex. This will match strings surrounded by single quotes,
# words without any whitespace in them, and the end of the string. We then use finditer()
# to grab all non-overlapping tokens.
lexer = re.compile(r"'[^']*'|[^ ]+|$")

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

results = []
buff = []

# Iterate through all the tokens our lexer identified and parse accordingly
for match in lexer.finditer(string):
    token = match.group(0) # group 0 is the entire matching string

    if token in ('and', ''):
        # Once we reach 'and' or the end of the string '' (matched by $)
        # We join all previous tokens with a space and add to our results.
        results.append(' '.join(buff))
        buff = [] # Reset for the next set of tokens
    else:
        buff.append(token)

print results

Demo

Edit: Here's a more concise version, effectively replacing the for loop in the above statement with itertools.groupby.

import re
from itertools import groupby

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

lexer = re.compile(r"'[^']*'|[^\s']+")
grouping = groupby(lexer.findall(string), lambda x: x == 'and')
results = [ ' '.join(g) for k, g in grouping if not k ]

print results

Demo

Aldehir
  • 2,025
  • 13
  • 10
0

I would just use the fact that re.split has this feature:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Which combined with usage of two capturing groups will return a list of None separated strings. This keeps regex simple, albeit needs some after-merging.

>>> tokens = re.split(r"('[^']*')|and", string)
# ['section_category_name = ', "'computer and equipment expense'", ' ', None, ' date >= 2015-01-01 ', None, ' date <= 2015-03-31']    
>>> ''.join([t if t else '\0' for t in tokens]).split('\0')
["section_category_name = 'computer and equipment expense' ", ' date >= 2015-01-01 ', ' date <= 2015-03-31']

Note, 0x00 char is used there as a temporary separator, so if you need to process strings with nulls it won't work very well.

scope
  • 1,967
  • 14
  • 15
0

I'm not sure what you want to do about whitespace surrounding and, and what you want to do about repeated ands in the string. What would you want if your string was 'hello and and bye', or 'hello andand bye'?

I haven't tested all the corner cases, and I strip whitespace around 'and', which may or may not be what you want:

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
res = []
spl = 'and'
for idx, sub in enumerate(string.split("'")):
  if idx % 2 == 0:
    subsub = sub.split(spl)
    for jdx in range(1, len(subsub) - 1):
      subsub[jdx] = subsub[jdx].strip()
    if len(subsub) > 1:
      subsub[0] = subsub[0].rstrip()
      subsub[-1] = subsub[-1].lstrip()
    res += [i for i in subsub if i.strip()]
  else:
    quoted_str = "'" + sub + "'"
    if res:
      res[-1] += quoted_str
    else:
      res.append(quoted_str)

An even simpler solution, if you know that and will be surrounded by a space on either side, and that it will not be repeated, and don't want to remove the extra whitespace:

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
spl = 'and'
res = []
spaced_spl = ' ' + spl + ' '
for idx, sub in enumerate(string.split("'")):
  if idx % 2 == 0:
    res += [i for i in sub.split(spaced_spl) if i.strip()]
  else:
    quoted_str = "'" + sub + "'"
    if res:
      res[-1] += quoted_str
    else:
      res.append(quoted_str)

Output:

["section_category_name = 'computer and equipment expense'", 'date >= 2015-01-01', 'date <= 2015-03-31']
texasflood
  • 1,571
  • 1
  • 13
  • 22