Python - Split a row into columns - csv data

Question

Am trying to read data from csv file, split each row into respective columns.

But my regex is failing when a particular column has commas with in itself.

eg: a,b,c,"d,e, g,",f

I want result like:

a    b    c    "d,e, g,"    f

which is 5 columns.

Here is the regex am using to split the string by comma

,(?=(?:"[^"]?(?:[^"])*))|,(?=[^"]+(?:,)|,+|$)

but it fails for few strings while it works for others.

All am looking for is, when I read data from csv using pyspark into dataframe/rdd, I want to load/preserve all the columns without any mistakes

Thank You

Why bother if there are multiple options available out of the box? http://stackoverflow.com/a/34528938/1560062 — zero323, Aug 09 '16 at 17:05

Jan · Accepted Answer · 2016-08-10T09:05:45.377

Much easier with the help of the newer regex module:

import regex as re

string = 'a,b,c,"d,e, g,",f'
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,')

parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f']

It supports the (*SKIP)(*FAIL) mechanism, which ignores everything betweem double quotes in this example.

If you have escaped double quotes, you could use:

import regex as re

string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''
rx = re.compile(r'".*?(?<!\\)"(*SKIP)(*FAIL)|,')
parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f', ' this', ' one', ' with "escaped "double",quotes:""']

See a demo for the latter on regex101.com.

For nearly 50 points, I feel to provide the csv methods as well:

import csv
string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''

# just make up an iterable, normally a file would go here
for row in csv.reader([string]):
    print(row)
    # ['a', 'b', 'c', 'd,e, g,', 'f', ' this', ' one', ' with "escaped "double"', 'quotes:""']

score 3 · Answer 2 · answered Aug 09 '16 at 16:11

Try \,(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$).

Used this answer which explains how to match everything that is not in quotes ignoring escaped quotes and http://regexr.com/ to test.

Note that - as other answers to your question state - there are better ways to parse CSV than use a regex.

score 3 · Answer 3 · edited Apr 12 '17 at 07:31

You can't easily parse CSV files with regex.

My go-to toolkit for handling CSV from the Unix command line is csvkit, which you can get from https://csvkit.readthedocs.io . It has a Python library as well.

The Python docs for the standard csv library are here: https://docs.python.org/2/library/csv.html

There is an extensive discussion of parsing CSV here:

https://softwareengineering.stackexchange.com/questions/166454/can-the-csv-format-be-defined-by-a-regex

This is a well-trodden path, and the libraries are good enough that you shouldn't roll your own code.

Python - Split a row into columns - csv data

3 Answers3