-1

I am parsing comma-delimited files with Python where some of the text fields are double-delimited with quotes because the text contains non-delimiting commas. For example, given this line of input:

field_1,field_2,...,"this,field,contains,non-delimiting,commas",...,field_n

I need to treat "this,field,contains,non-delimiting,commas" as a single quote-delimited field containing pesky commas.

My code handles this by comparing the indices of all commas and quotes in each line of input and slicing the line at the indices of all commas outside of paired quotes.

This strikes me as un-Pythonic, though, and I am hoping to get some to get suggestions for a more Pythonic solution.

Schemer
  • 1,635
  • 4
  • 19
  • 39
  • Can fields like field_1 be like integer? e.g. 123, "helloWorld", 99, "ha,ha,ha" – Samuel Toh Jun 09 '16 at 00:22
  • @SamuelToh: Yes. Fields can contain any character and represent any data type. The only consideration at this stage though is just to tokenize the fields as text while handling the inner delimiters. – Schemer Jun 09 '16 at 00:27
  • 5
    use the csv reader: https://docs.python.org/2/library/csv.html – Casimir et Hippolyte Jun 09 '16 at 00:32
  • 1
    If you have to deal with non english language in your file the csv package provided in python library is awful. take a look at unicodecsv - https://pypi.python.org/pypi/unicodecsv/0.14.1 – Lynch Jun 09 '16 at 00:59

2 Answers2

3

This is something that is directly handled by the csv module using csv.QUOTE_MINIMAL for quoting (comes as part of the excel dialect, possibly others).

Use csv.reader with appropriate flags, and do not roll your own parser please.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0

You can use a fairly simple generator to accomplish this.

def tokenize(input):
    outstr = ""
    stringmode = False
    for char in input:
        if char  == "," and not stringmode:
            yield outstr
            outstr = ""
            continue
        elif char == "'" or char == '"':
            stringmode = not stringmode
        outstr += char

First, we turn input into a reversed list of itself, so that we can efficiently pop characters from the front of the string. Then we simply loop through the string, yielding when we reach a comma and are not in a quote, and toggling whether or not we are in a quote each time we reach a quote

Community
  • 1
  • 1
Natecat
  • 2,175
  • 1
  • 17
  • 20
  • is there a reason you are making the data `reversed` and then doing `while input: char = input.pop()` instead of just doing `for char in input:`? – Tadhg McDonald-Jensen Jun 09 '16 at 00:43
  • @TadhgMcDonald-Jensen There was initially, but after some thinking I got rid of the part that needed it but didn't change it back. – Natecat Jun 09 '16 at 00:45
  • 3
    Why are we reinventing [the `csv` module](https://docs.python.org/2/library/csv.html)? – ShadowRanger Jun 09 '16 at 00:46
  • consider this file: a,"b,c,\",e",a. in my example there is a quote inside the string, you code does not support escaped quotes. you should reuse tested code when possible, the csv module adress this issue. – Lynch Jun 09 '16 at 00:56