2

I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are enclosed within quotes (" or '). The resulting split string should also have the quotes around the fields removed. Also, some fields can be empty.

Example:

hey,hello,,"hello,world",'hey,world'

needs to be split into 5 parts like below

['hey', 'hello', '', 'hello,world', 'hey,world']

Any ideas/thoughts/suggestions/help with how to go about solving the above problem in Python would be much appreciated.

Thank You, Vish

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Vish
  • 827
  • 11
  • 21
  • It would be very helpful if you specified what you want to happen in some cases that your simple example doesn't cover: (1) `'abcd'efgh` (2) `'abcd'"efgh"` (3) `abcd"efgh"` -- do you want it to produce one field from each of those (WITH QUOTES UNSTRIPPED) or an exception or something else? – John Machin Feb 14 '11 at 20:40
  • Also, supposing your input file is produced by querying a customer database with a not-implausible address line like `"Dunromin", 123 O'Brien's Road`, how would that be quoted/escaped in the input file? – John Machin Feb 14 '11 at 23:57

4 Answers4

8

Sounds like you want the CSV module.

SpliFF
  • 38,186
  • 16
  • 91
  • 120
  • 1
    -1 Sounds like that's 10 folk (at time of writing) who didn't read the fine print: TWO quote characters e.g. `"hello,world",'hey,world'` -- the csv module won't do that. – John Machin Feb 13 '11 at 07:01
  • 1
    @John: We may disagree on a lot of things, but I have a feeling we agree that the voting system here sometimes has its, uh, weak points... – Glenn Maynard Feb 13 '11 at 09:35
  • Well, I for one think that Original poster made a mistake of specifying two kinds of quotes (unescaped) in a string. For e.g. the input given in the question can never be represented as string as it is. So, after giving the chance for the OP to correct the input, can csv still help? Ofcourse, OP did not ask about handling these different kinds of strings in a string and all he wanted to quoted comma to be single word. – Senthil Kumaran Feb 13 '11 at 10:12
  • @Senthil Kumaran: Please explain """the input given in the question can never be represented as string as it is""" ... you appear to be saying that the series of ASCII bytes `hey,hello,,"hello,world",'hey,world'` can't be represented as a string, which is a nonsense. What are `"""`, `'''`, `\"` and `\'` for? Why can't that series of bytes be read from a file? – John Machin Feb 13 '11 at 11:16
  • @John I was trying to say that, there were unescaped quotes in the input string. "''" and '""' are valid strings, but """''" should be represented as "\"\"''" or '""\'\''. – Senthil Kumaran Feb 13 '11 at 11:22
5

(Edit: The original answer had trouble with empty fields on the edges due to the way re.findall works, so I refactored it a bit and added tests.)

import re

def parse_fields(text):
    r"""
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
    ['hey', 'hello', '', 'hello,world', 'hey,world']
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
    ['hey', 'hello', '', 'hello,world', 'hey,world', '']
    >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
    ['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
    >>> list(parse_fields(''))
    ['']
    >>> list(parse_fields(','))
    ['', '']
    >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
    ['testing', 'quotes not at "the" beginning \'of\' the', 'string']
    >>> list(parse_fields('testing,"unterminated quotes'))
    ['testing', '"unterminated quotes']
    """
    pos = 0
    exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
    while True:
        m = exp.search(text, pos)
        result = m.group(2)
        separator = m.group(3)

        yield result

        if not separator:
            break

        pos = m.end(0)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

(['"]?) matches an optional single- or double-quote.

(.*?) matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned to result, and it's what we actually yield as a result.

\1 is a backreference, to match the same single- or double-quote we matched earlier (if any).

(,|$) matches the comma separating each entry, or the end of the line. This is assigned to separator.

If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (m.end(0)), and continue the loop.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
  • @John: I didn't look into precisely why the extra blank match shows up at the end, but it's probably due to the fact that the regex matches the null string. I just tested to make sure it always happens, to be sure that trimming it off works reliably. – Glenn Maynard Feb 13 '11 at 12:08
  • Hey Glenn, This is great! I am still trying to grok that piece of regex code. It is ingenious. A couple of questions if you don't mind. How do the quotes not end up in the result? You also mentioned that this pattern ('|'"|) matches nothing at the end... How can it match NOTHING? NOTHING has nothing to compare to...my head is spinning thinking about it... – Vish Feb 14 '11 at 16:11
  • There are three quotes you're allowing: a single-quote, a double-quote and the null string. Another way you could write that would be `(['"]?)`. It's the same as any `a?` match: match one or zero quotation marks. – Glenn Maynard Feb 14 '11 at 21:52
  • -1 @Glenn Maynard, @user376263: **Sorry, it doesn't always produce a superfluous "blank match"**. An input of `,,,` or `a,b,c,` should give an answer containing 4 items, the last of which is a 0-length string; however that is removed. – John Machin Feb 14 '11 at 23:41
  • user324958290358: The quotes aren't in the result because of the groupings. The text inside the quotes is in its own (matching group). – Glenn Maynard Feb 15 '11 at 02:47
2

I fabricated something like this. Very redundant I suppose, but it does the job for me. You have to adapt it a bit to your specifications:

def csv_splitter(line):
    splitthese = [0]
    splitted = []
    splitpos = True
    for nr, i in enumerate(line):
        if i == "\"" and splitpos == True: 
            splitpos = False
        elif i == "\"" and splitpos == False:
            splitpos = True
        if i == "," and splitpos == True:
            splitthese.append(nr)
    splitthese.append(len(line)+1) 
    for i in range(len(splitthese)-1):
        splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]]))
    return splitted
Iwe M
  • 31
  • 3
2

The csv module won't handle the scenario of " and ' being quotes at the same time. Absent a module that provides that kind of dialect, one has to get into the parsing business. To avoid reliance on a third party module, we can use the re module to do the lexical analysis, using the re.MatchObject.lastindex gimmick to associate a token type with the matched pattern.

The following code when run as a script passes all the tests shown, with Python 2.7 and 2.2.

import re

# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)

_pattern_tuples = (
    (r'"[^"]*"', DQUOTED),
    (r"'[^']*'", SQUOTED),
    (r",", COMMA),
    (r"$", NEWLINE), # matches end of string OR \n just before end of string
    (r"[^,\n]+", UNQUOTED), # order in the above list is important
    )
_matcher = re.compile(
    '(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
    ).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1 

def csv_split(text):
    """Split a csv string into a list of fields.
    Fields may be quoted with " or ' or be unquoted.
    An unquoted string can contain both a " and a ', provided neither is at
    the start of the string.
    A trailing \n will be ignored if present.
    """
    fields = []
    pos = 0
    want_field = True
    while 1:
        m = _matcher(text, pos)
        if not m:
            raise ValueError("Problem at offset %d in %r" % (pos, text))
        ttype = _toktype[m.lastindex]
        if want_field:
            if ttype in (DQUOTED, SQUOTED):
                fields.append(m.group(0)[1:-1])
                want_field = False
            elif ttype == UNQUOTED:
                fields.append(m.group(0))
                want_field = False
            elif ttype == COMMA:
                fields.append("")
            else:
                assert ttype == NEWLINE
                fields.append("")
                break
        else:
            if ttype == COMMA:
                want_field = True
            elif ttype == NEWLINE:
                break
            else:
                print "*** Error dump ***", ttype, repr(m.group(0)), fields
                raise ValueError("Missing comma at offset %d in %r" % (pos, text))
        pos = m.end(0)
    return fields

if __name__ == "__main__":
    tests = (
        ("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
        ("""\n""", ['']),
        ("""""", ['']),
        ("""a,b\n""", ['a', 'b']),
        ("""a,b""", ['a', 'b']),
        (""",,,\n""", ['', '', '', '']),
        ("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
        ("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
        )
    for text, expected in tests:
        result = csv_split(text)
        print
        print repr(text)
        print repr(result)
        print repr(expected)
        print result == expected
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • This works great John. I appreciate it very much. I really really wish there was a better DSL for regex – Vish Feb 14 '11 at 16:22