3

Am trying to read data from csv file, split each row into respective columns.

But my regex is failing when a particular column has commas with in itself.

eg: a,b,c,"d,e, g,",f

I want result like:

a    b    c    "d,e, g,"    f  

which is 5 columns.

Here is the regex am using to split the string by comma

,(?=(?:"[^"]?(?:[^"])*))|,(?=[^"]+(?:,)|,+|$)

but it fails for few strings while it works for others.

All am looking for is, when I read data from csv using pyspark into dataframe/rdd, I want to load/preserve all the columns without any mistakes

Thank You

Alekhya Vemavarapu
  • 1,145
  • 1
  • 10
  • 26
  • Why bother if there are multiple options available out of the box? http://stackoverflow.com/a/34528938/1560062 – zero323 Aug 09 '16 at 17:05

3 Answers3

4

Much easier with the help of the newer regex module:

import regex as re

string = 'a,b,c,"d,e, g,",f'
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,')

parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f']

It supports the (*SKIP)(*FAIL) mechanism, which ignores everything betweem double quotes in this example.


If you have escaped double quotes, you could use:
import regex as re

string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''
rx = re.compile(r'".*?(?<!\\)"(*SKIP)(*FAIL)|,')
parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f', ' this', ' one', ' with "escaped "double",quotes:""']

See a demo for the latter on regex101.com.


For nearly 50 points, I feel to provide the csv methods as well:
import csv
string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''

# just make up an iterable, normally a file would go here
for row in csv.reader([string]):
    print(row)
    # ['a', 'b', 'c', 'd,e, g,', 'f', ' this', ' one', ' with "escaped "double"', 'quotes:""']
Jan
  • 42,290
  • 8
  • 54
  • 79
3

Try \,(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$).

Used this answer which explains how to match everything that is not in quotes ignoring escaped quotes and http://regexr.com/ to test.

Note that - as other answers to your question state - there are better ways to parse CSV than use a regex.

user2609980
  • 10,264
  • 15
  • 74
  • 143
3

You can't easily parse CSV files with regex.

My go-to toolkit for handling CSV from the Unix command line is csvkit, which you can get from https://csvkit.readthedocs.io . It has a Python library as well.

The Python docs for the standard csv library are here: https://docs.python.org/2/library/csv.html

There is an extensive discussion of parsing CSV here:

https://softwareengineering.stackexchange.com/questions/166454/can-the-csv-format-be-defined-by-a-regex

This is a well-trodden path, and the libraries are good enough that you shouldn't roll your own code.

Community
  • 1
  • 1
vielmetti
  • 1,864
  • 16
  • 23