75

I need to split a string like this, on semicolons. But I don't want to split on semicolons that are inside of a string (' or "). I'm not parsing a file; just a simple string with no line breaks.

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

Result should be:

  • part 1
  • "this is ; part 2;"
  • 'this is ; part 3'
  • part 4
  • this "is ; part" 5

I suppose this can be done with a regex but if not; I'm open to another approach.

Sylvain
  • 19,099
  • 23
  • 96
  • 145
  • Do you have more examples? or there are more kind of "parts"? – msemelman May 07 '10 at 03:32
  • I don't think so. I want to split on semicolons and ignore semicolons inside quotes. I'd consider any solution that does not do *exactly* that to be invalid. Can you think of other cases that could break the solutions provided so far? – Sylvain May 07 '10 at 03:43
  • Can quotes appear escaped inside strings? e.g. `"this is a \"quoted\" string"`? If so then a regex solution is going to be fiendishly difficult or even impossible. – Dave Kirby May 07 '10 at 06:30
  • No; I don't have to support that case. – Sylvain May 08 '10 at 11:42
  • The second line of the example output is missing a semicolon. It's correct below in the answers. Should be: `"this is ; part 2;"` – Harvey Jan 06 '11 at 03:17

17 Answers17

55

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;' with ';<marker>;' where <marker> would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:

>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]

However this is a kludge. Any better suggestions?

Duncan
  • 92,073
  • 11
  • 122
  • 156
  • oh btw, `[^;"']+` would be better than `([^;"']...)+` I think – YOU May 07 '10 at 11:10
  • I don't think that `[^;"']+` helps. You still need the + outside the group to handle something that is a mix of ordinary characters and quoted elements. Elements which can repeat and themselves contain repeats are a great way to kill regular expression matching so should be avoided when possible. – Duncan May 07 '10 at 14:42
  • 2
    Thanks a lot -- I encountered same problem but with whitespace, so I just substituted the semicolon for a space and it worked perfectly. – ds1848 Dec 14 '13 at 00:21
  • This does not match `aaa;;aaa` – Jean-Luc Nacif Coelho Jun 27 '15 at 22:22
  • @Jean-LucNacifCoelho, yes it does: `>>> PATTERN.split("aaa;;aaa")[1::2]` output is `['aaa', 'aaa']` – Duncan Jun 28 '15 at 19:18
  • 1
    But shouldn't it output `['aaa', '', 'aaa']`? – Jean-Luc Nacif Coelho Jun 29 '15 at 22:48
  • @Jean-LucNacifCoelho, oh yes. I see what you mean. – Duncan Jun 30 '15 at 08:11
  • doesn't work if there is an apostrophe in the data because it treats it as a single quote – jbchurchill May 21 '16 at 01:47
  • There is still an issue with cases with trailing odd numbered semicolons : aaa;;aaa;;; – Edmond Lafay-David Jun 07 '16 at 17:54
  • `PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+|(?=;;)|(?=;$)|(?=^;))''')` will catch all the empty groups, including at the beginning and at the end. It can be further used as `PATTERN.findall(data)` – Roman Mar 08 '17 at 13:03
  • use csv.reader: https://stackoverflow.com/questions/43067373/split-by-comma-and-how-to-exclude-comma-from-quotes-in-split-python – Sheng Bi Feb 25 '19 at 01:47
  • @ShengBi if you think the original question can be answered using csv reader then feel free to add your own answer, but it looks like this isn't a case you can do easily using csv.reader. – Duncan Feb 25 '19 at 10:15
43
re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)

Each time it finds a semicolon, the lookahead scans the entire remaining string, making sure there's an even number of single-quotes and an even number of double-quotes. (Single-quotes inside double-quoted fields, or vice-versa, are ignored.) If the lookahead succeeds, the semicolon is a delimiter.

Unlike Duncan's solution, which matches the fields rather than the delimiters, this one has no problem with empty fields. (Not even the last one: unlike many other split implementations, Python's does not automatically discard trailing empty fields.)

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Thank you Alan, I almost missed this response. It is similar to Duncan's but it slices the string more elegantly. I had a similar problem and it worked perfectly. – marshall.ward Feb 28 '11 at 13:55
  • For each `;` this solution will run lookahead making sure that quotes are balanced after this semicolon (otherwise this semicolon is quoted and should be omitted). So, the complexity is `O(n^2)` (assuming number of `;` is growing linear with length of the string). – ovgolovin Jun 30 '13 at 12:44
  • Thanks Alan. You saved my day :) – Painkiller Jun 08 '16 at 07:07
  • should have more likes than Duncan's as it can handle the empty string correctly! – juan Isaza Jan 11 '17 at 20:31
  • 2
    Note this doesn't seem to handle escaped quote marks, e.g. `'"scarlett o\'hara"; rhett butler'` - whereas Duncan's solution does. – nrflaw Jan 31 '18 at 18:24
23
>>> a='A,"B,C",D'
>>> a.split(',')
['A', '"B', 'C"', 'D']

It failed. Now try csv module
>>> import csv
>>> from StringIO import StringIO
>>> data = StringIO(a)
>>> data
<StringIO.StringIO instance at 0x107eaa368>
>>> reader = csv.reader(data, delimiter=',') 
>>> for row in reader: print row
... 
['A,"B,C",D']
Mohammad Shahid Siddiqui
  • 3,730
  • 2
  • 27
  • 12
  • 3
    I scrolled down the page this far to answer the exact same thing, it's a shame this answer is so fare down, csv module is absolutely the right way to go – Edmond Lafay-David Jun 07 '16 at 20:36
  • 4
    In Python3.0 do `from io import StringIO` instead of `StringIO`. From https://docs.python.org/3.0/whatsnew/3.0.html "The StringIO and cStringIO modules are gone. Instead, import the io module and use io.StringIO or io.BytesIO for text and data respectively." – Pritesh Ranjan Aug 11 '19 at 05:36
  • This only works when quotes surround the *entire* "field". It doesn't work for OP's input; it gives `['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5']`, incorrectly splitting some parts and removing the enclosing quotes from part 2. Aside from that, Simon Callan already detailed the approach nearly 6 years before this answer, so I have no idea why this one got more credit. – Karl Knechtel Jul 25 '23 at 04:56
11

Here is an annotated pyparsing approach:

from pyparsing import (printables, originalTextFor, OneOrMore, 
    quotedString, Word, delimitedList)

# unquoted words can contain anything but a semicolon
printables_less_semicolon = printables.replace(';','')

# capture content between ';'s, and preserve original text
content = originalTextFor(
    OneOrMore(quotedString | Word(printables_less_semicolon)))

# process the string
print delimitedList(content, ';').parseString(test)

giving

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 
 'this "is ; part" 5']

By using pyparsing's provided quotedString, you also get support for escaped quotes.

You also were unclear how to handle leading whitespace before or after a semicolon delimiter, and none of your fields in your sample text has any. Pyparsing would parse "a; b ; c" as:

['a', 'b', 'c']
Russ
  • 10,835
  • 12
  • 42
  • 57
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • 1
    +1 I was about to post a pyparsing solution but yours is more elegant – Luper Rouch May 07 '10 at 12:56
  • 1
    This answer is tremendously useful. Starting here, I was able to dl, install, and write a simple IMAP header parser in 10 lines. Thanks! – Harvey Jan 06 '11 at 19:01
  • This is great! However, in cases where a value is blank (e.g.:[ ,23,43,38,75,26,19,37,43,19,27,25,20,34,22,23] ) I get pyparsing.ParseException: Expected {quotedString using single or double quotes | W:(0123...)} (at char 0), (line:1, col:1) – chri_chri Jun 20 '18 at 09:48
  • @LuperRouch this answer is by the author of pyparsing, so I should hope it's elegant ;) – Karl Knechtel Jul 25 '23 at 05:03
9

You appears to have a semi-colon seperated string. Why not use the csv module to do all the hard work?

Off the top of my head, this should work

import csv 
from StringIO import StringIO 

line = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''

data = StringIO(line) 
reader = csv.reader(data, delimiter=';') 
for row in reader: 
    print row 

This should give you something like
("part 1", "this is ; part 2;", 'this is ; part 3', "part 4", "this \"is ; part\" 5")

Edit:
Unfortunately, this doesn't quite work, (even if you do use StringIO, as I intended), due to the mixed string quotes (both single and double). What you actually get is

['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5'].

If you can change the data to only contain single or double quotes at the appropriate places, it should work fine, but that sort of negates the question a bit.

Simon Callan
  • 3,020
  • 1
  • 23
  • 34
  • 1
    +1: csv.reader takes an iterable, so you need to wrap the input string in a list: `csv.reader([data], delimiter=';')`. Apart from that it does exactly what the user wants. This will also handle embedded quotes characters prefixed with a backslash. – Dave Kirby May 07 '10 at 06:35
  • 1
    actually, csv module isn't that smart, doesn't work when I tested. his data has both single quotes and double quotes, and csv module cannot handle `this "is ; part" 5` as single block, which result in `['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5']` – YOU May 07 '10 at 06:38
  • 2
    The csv module not only doesn't handle more than one quote type, but it also insists that fields are entirely quoted or not quoted at all. That means part 5 will be split in two because a double quote in the middle of a field is just a literal not quoting the content. I'm afraid in this case the options are (a) use an excessively complex regular expression, or (b) get the format of the input data changed to use some recognisable variant of CSV. If it was me I'd go for option (b). – Duncan May 07 '10 at 07:48
4

While it could be done with PCRE via lookaheads/behinds/backreferences, it's not really actually a task that regex is designed for due to the need to match balanced pairs of quotes.

Instead it's probably best to just make a mini state machine and parse through the string like that.

Edit

As it turns out, due to the handy additional feature of Python re.findall which guarantees non-overlapping matches, this can be more straightforward to do with a regex in Python than it might otherwise be. See comments for details.

However, if you're curious about what a non-regex implementation might look like:

x = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

results = [[]]
quote = None
for c in x:
  if c == "'" or c == '"':
    if c == quote:
      quote = None
    elif quote == None:
      quote = c
  elif c == ';':
    if quote == None:
      results.append([])
      continue
  results[-1].append(c)

results = [''.join(x) for x in results]

# results = ['part 1', '"this is ; part 2;"', "'this is ; part 3'",
#            'part 4', 'this "is ; part" 5']
Amber
  • 507,862
  • 82
  • 626
  • 550
  • 1
    The question does not require balancing at all - just enclosing and single-character escaping. It's a pretty straightforward (and actually formally regular) pattern. – Max Shawabkeh May 07 '10 at 02:38
  • Actually, the only reason `findall` works is due to the additional restriction implemented in Python that the returned matches be *non-overlapping*. Otherwise, a string like `'''part 1;"this 'is' sparta";part 2'''` would fail due to the pattern also matching a subset of the string. – Amber May 07 '10 at 02:45
  • I'm using `findall` because we need to extract the string. Formally, regular expressions only do matching. To match, we can simply use `^mypattern(;mypattern)*$`. – Max Shawabkeh May 07 '10 at 02:48
  • However, doing so gives up, as you point out, the ability to extract the text in a nice manner (though I suppose you could iterate through an indefinite number of captures). – Amber May 07 '10 at 02:51
  • Oh, yours is much nicer than mine. :) – Ipsquiggle May 07 '10 at 03:12
4
>>> x = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> import re
>>> re.findall(r'''(?:[^;'"]+|'(?:[^']|\\.)*'|"(?:[^']|\\.)*")+''', x)
['part 1', "this is ';' part 2", "'this is ; part 3'", 'part 4', 'this "is ; part" 5']
Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91
  • Fails on the following string: `'''part 1;"this is ';' part 2;";'this is ; part 3';part 4'''` – Amber May 07 '10 at 02:33
  • Right. Fixed. Forgot to swap the single/double quotes in the second part. – Max Shawabkeh May 07 '10 at 02:34
  • I'm sorry, I missed something in my test case. See part 5 in my question. Thanks – Sylvain May 07 '10 at 02:51
  • Your 5th test case is probably going to render this solution much less viable. – Amber May 07 '10 at 02:55
  • Ok, I really just want to ignore semicolons inside quotes. I don't want quotes to act as separators. – Sylvain May 07 '10 at 02:56
  • It seems this is the approach Amber was alluding to at the time. Using `findall` is elegant, but I think the regex itself can be simplified; and this is seriously lacking in explanation. – Karl Knechtel Jul 25 '23 at 05:02
2

since you do not have '\n', use it to replace any ';' that is not in a quote string

>>> new_s = ''
>>> is_open = False

>>> for c in s:
...     if c == ';' and not is_open:
...         c = '\n'
...     elif c in ('"',"'"):
...         is_open = not is_open
...     new_s += c

>>> result = new_s.split('\n')

>>> result
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']
remosu
  • 5,039
  • 1
  • 23
  • 16
  • Clean and simple. Since it's just a simple string, no need to worry about efficiency. To handle nested quotes, may need to tweak the elif statement. – Dingle May 07 '10 at 20:52
2

we can create a function of its own

def split_with_commas_outside_of_quotes(string):
    arr = []
    start, flag = 0, False
    for pos, x in enumerate(string):
        if x == '"':
            flag= not(flag)
        if flag == False and x == ',':
            arr.append(string[start:pos])
            start = pos+1
    arr.append(string[start:pos])
    return arr
Pradeep Pathak
  • 444
  • 5
  • 6
1

This regex will do that: (?:^|;)("(?:[^"]+|"")*"|[^;]*)

dawg
  • 98,345
  • 23
  • 131
  • 206
  • You'll want to add another option for single quotes as well. – Amber May 07 '10 at 02:25
  • Which will then break, unless you can use backreferences in python's `re` module (which don't appear documented). The second you support both types of quotes, you could potentially match this `"quoted'` vs `"quoted' single quote"` – dlamotte May 07 '10 at 02:28
  • 1
    Also see http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns – killdash10 May 07 '10 at 02:29
  • @xyld: Python's `re` module does support backreferences. @killdash10: That's irrelevant. The OP is not trying to parse nested patterns. – Max Shawabkeh May 07 '10 at 02:31
  • @killdash10 exactly, but with backreferences in perl, you can do it ;) Breaks the whole pumping lemma, DFA/NFA thing because the regular expression has state, very small/limited state, but state none-the-less – dlamotte May 07 '10 at 02:32
  • That won't work if you have escaped quotes inside a string. Think `"s\"r\\\"g\\\"\""`. I think regex is the wrong approach here because regular expressions can't count and can't recurse. Regular expressions can't jump, if you will. – wilhelmtell May 07 '10 at 02:32
  • @max they didn't look documented? Can you post a link? – dlamotte May 07 '10 at 02:32
  • Also: fails on the following string: `'''part 1;"this is ';' part 2;";'this is "part" 3';part 4'''` – Amber May 07 '10 at 02:33
  • @xyld: See the explanation of `(...)` here: http://docs.python.org/library/re.html#regular-expression-syntax – Max Shawabkeh May 07 '10 at 02:36
  • Well sure enough, then its possible with a `re.findall()`, but definitely not **one** regex search across the string... You can search it multiple times with one regex and do it. I dont know of a great way to do this any other way in python and be efficient? – dlamotte May 07 '10 at 02:38
1

Although the topic is old and previous answers are working well, I propose my own implementation of the split function in python.

This works fine if you don't need to process large number of strings and is easily customizable.

Here's my function:

# l is string to parse; 
# splitchar is the separator
# ignore char is the char between which you don't want to split

def splitstring(l, splitchar, ignorechar): 
    result = []
    string = ""
    ignore = False
    for c in l:
        if c == ignorechar:
            ignore = True if ignore == False else False
        elif c == splitchar and not ignore:
            result.append(string)
            string = ""
        else:
            string += c
    return result

So you can run:

line= """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
splitted_data = splitstring(line, ';', '"')

result:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

The advantage is that this function works with empty fields and with any number of separators in the string.

Hope this helps!

0

Even though I'm certain there is a clean regex solution (so far I like @noiflection's answer), here is a quick-and-dirty non-regex answer.

s = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

inQuotes = False
current = ""
results = []
currentQuote = ""
for c in s:
    if not inQuotes and c == ";":
        results.append(current)
        current = ""
    elif not inQuotes and (c == '"' or c == "'"):
        currentQuote = c
        inQuotes = True
    elif inQuotes and c == currentQuote:
        currentQuote = ""
        inQuotes = False
    else:
        current += c

results.append(current)

print results
# ['part 1', 'this is ; part 2;', 'this is ; part 3', 'part 4', 'this is ; part 5']

(I've never put together something of this sort, feel free to critique my form!)

Ipsquiggle
  • 1,814
  • 1
  • 15
  • 25
0

My approach is to replace all non-quoted occurrences of the semi-colon with another character which will never appear in the text, then split on that character. The following code uses the re.sub function with a function argument to search and replace all occurrences of a srch string, not enclosed in single or double quotes or parens, brackets or braces, with a repl string:

def srchrepl(srch, repl, string):
    """
    Replace non-bracketed/quoted occurrences of srch with repl in string.
    """
    resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
                          + srch + """])|(?P<rbrkt>[)\]}])""")
    return resrchrepl.sub(_subfact(repl), string)


def _subfact(repl):
    """
    Replacement function factory for regex sub method in srchrepl.
    """
    level = 0
    qtflags = 0
    def subf(mo):
        nonlocal level, qtflags
        sepfound = mo.group('sep')
        if  sepfound:
            if level == 0 and qtflags == 0:
                return repl
            else:
                return mo.group(0)
        elif mo.group('lbrkt'):
            if qtflags == 0:
                level += 1
            return mo.group(0)
        elif mo.group('quote') == "'":
            qtflags ^= 1            # toggle bit 1
            return "'"
        elif mo.group('quote') == '"':
            qtflags ^= 2            # toggle bit 2
            return '"'
        elif mo.group('rbrkt'):
            if qtflags == 0:
                level -= 1
            return mo.group(0)
    return subf

If you don't care about the bracketed characters, you can simplify this code a lot.
Say you wanted to use a pipe or vertical bar as the substitute character, you would do:

mylist = srchrepl(';', '|', mytext).split('|')

BTW, this uses nonlocal from Python 3.1, change it to global if you need to.

Don O'Donnell
  • 4,538
  • 3
  • 26
  • 27
0

A generalized solution:

import re
regex = '''(?:(?:[^{0}"']|"[^"]*(?:"|$)|'[^']*(?:'|$))+|(?={0}{0})|(?={0}$)|(?=^{0}))'''

delimiter = ';'
data2 = ''';field 1;"field 2";;'field;4';;;field';'7;'''
field = re.compile(regex.format(delimiter))
print(field.findall(data2))

Outputs:

['', 'field 1', '"field 2"', '', "'field;4'", '', '', "field';'7", '']

This solution:

  • captures all the empty groups (including at the beginning and at the end)
  • works for most popular delimiters including space, tab, and comma
  • treats quotes inside quotes of the other type as non-special characters
  • if an unmatched unquoted quote is encountered, treats the remainders of the line as quoted
Roman
  • 2,225
  • 5
  • 26
  • 55
0

Instead of splitting on a separator pattern, just capture whatever you need:

>>> import re
>>> data = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> re.findall(r';([\'"][^\'"]+[\'"]|[^;]+)', ';' + data)
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ', ' part" 5']
Michael Spector
  • 36,723
  • 6
  • 60
  • 88
0

Simplest is to use shlex (Simple lexical analysis) -- a built in module in Python

import shlex
shlex.split("""part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5 """ )

['part',
 '1;this is ; part 2;;this is ; part 3;part',
 '4;this',
 'is ; part',
 '5']
Solomon Vimal
  • 920
  • 12
  • 27
-1

This seemed to me an semi-elegant solution.

New Solution:

import re
reg = re.compile('(\'|").*?\\1')
pp = re.compile('.*?;')
def splitter(string):
    #add a last semicolon
    string += ';'
    replaces = []
    s = string
    i = 1
    #replace the content of each quote for a code
    for quote in reg.finditer(string):
        out = string[quote.start():quote.end()]
        s = s.replace(out, '**' + str(i) + '**')
        replaces.append(out)
        i+=1
    #split the string without quotes
    res = pp.findall(s)

    #add the quotes again
    #TODO this part could be faster.
    #(lineal instead of quadratic)
    i = 1
    for replace in replaces:
        for x in range(len(res)):
            res[x] = res[x].replace('**' + str(i) + '**', replace)
        i+=1
    return res

Old solution:

I choose to match if there was an opening quote and wait it to close, and the match an ending semicolon. each "part" you want to match needs to end in semicolon. so this match things like this :

  • 'foobar;.sska';
  • "akjshd;asjkdhkj..,";
  • asdkjhakjhajsd.jhdf;

Code:

mm = re.compile('''((?P<quote>'|")?.*?(?(quote)\\2|);)''')
res = mm.findall('''part 1;"this is ; part 2;";'this is ; part 3';part 4''')

you may have to do some postprocessing to res, but it contains what you want.

msemelman
  • 2,877
  • 1
  • 21
  • 19