9

I'm struggling to split text rows, based on variable delimiter, and preserve empty fields and quoted data.

Examples:

1,"2",three,'four, 4',,"6\tsix"

or as tab-delimited vesion

1\t"2"\tthree\t'four, 4'\t\t"6\tsix"

Should both result in:

['1', '"2"', 'three', 'four, 4', '', "6\tsix"]

So far, i've tried:

  1. Using split, but clearly the quoted delimiters are not handled as desired.

  2. solutions using the csv library, but it tends to have options that quotes everything or nothing, without preserving the original quotes.

  3. Regex, particularly following the pattern from the following answer, but it drops the empty fields: How to split but ignore separators in quoted strings, in python?

  4. Using the pyparsing library. The best i've managed is as follows, but this also drops the empty fields (using the comma delimiter example):

    s = '1,"2",three,\'four, 4\',,"6\tsix"'
    wordchars = (printables + ' \t\r\n').replace(',', '', 1)
    delimitedList(OneOrMore(quotedString | Word(wordchars)), ',').parseWithTabs().parseString(s)
    

Thanks for any ideas!

Community
  • 1
  • 1
user2123203
  • 145
  • 7
  • For split to work you need to use a character or character combination that is guarenteed not to appear anywhere else in the string. Maybe you could use an escape character not on the keyboard? – PandaConda Jun 17 '14 at 15:42
  • I'm afraid I don't have control over the input file, so can't decide on the delimiter. Thanks for the idea though! – user2123203 Jun 17 '14 at 15:44
  • In that case regex would probably be the way to go. Not sure about the exact syntax though. – PandaConda Jun 17 '14 at 15:46
  • Split by comma and then trim – asimes Jun 17 '14 at 15:46
  • Do you always have the same number of fields (e.g. 6) in each text file? – Dologan Jun 17 '14 at 15:47
  • @asimes that appears to be close but it splits the '"four' and ' 4"' which is not what the OP is looking for. – brechmos Jun 17 '14 at 15:47
  • Wait, should the 'four, 4' be split into two separate members of the array or should they be the same member? – PandaConda Jun 17 '14 at 15:48
  • Why do you say that regex drops empty field ? The answer from Alan More in referenced post suggested `re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)`. I tried it (after changing `;` with `,`) and found `['1', '"2"', 'three', "'four, 4'", '', '"6\tsix"']` what is you said to expect. – Serge Ballesta Jun 17 '14 at 15:57

3 Answers3

7

This works for me:

import pyparsing as pyp

pyp.delimitedList(pyp.quotedString | pyp.SkipTo(',' | pyp.LineEnd()), ',') \
    .parseWithTabs().parseString(s)

Gives

['1', '"2"', 'three', "'four, 4'", '', '"6\tsix"']

Avoid creating Words with whitespace characters, or all printable characters. Pyparsing does not do any lookahead, and these expressions are likely to include much more than you had planned.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • What lib? I try to use the full path so others can better follow. – Russia Must Remove Putin Jun 17 '14 at 15:49
  • Sorry, I was piggybacking on the OP's reference to pyparsing. – PaulMcG Jun 17 '14 at 15:49
  • This works! Thank you so much! THe following works for the tab delimited version (same but with a White class). Maybe that was obvious, but i'm new to pyparsing, if you couldn't tell. `delimitedList(quotedString | SkipTo(White('\t') | LineEnd()), White('\t')).parseWithTabs().parseString(s)`. Thanks again! – user2123203 Jun 17 '14 at 15:59
  • If you want the quotation marks stripped from the parsed quoted strings, pyparsing will do that for you at parse time - before your parser definition, add `pyp.quotedString.setParseAction(pyp.removeQuotes)` – PaulMcG Jun 18 '14 at 12:31
3

use this pattern to match the commas outside double quotes
,(?=(?:(?:[^"]*\"){2})*[^"]*$)
Demo

Edit: to split commas outside double quotes or quotes use this pattern
,(?=(?:(?:[^'\"]*(?:\"|')){2})*[^'\"]*$)
Demo

alpha bravo
  • 7,838
  • 1
  • 19
  • 23
  • I think this version might work, but under match information it says: "No match groups were extracted. This means that your pattern matches but there were no (capturing (groups)) in it that matched anything in the subject string." So I'm not sure. – PandaConda Jun 17 '14 at 15:53
  • it was meant to use the matched commas as your split pattern – alpha bravo Jun 17 '14 at 15:54
  • `re.split(r',(?=(?:(?:[^"]*\"){2})*[^"]*$)', txt1)` returns `['1', '"2"', 'three', "'four", " 4'", '', '"6\tsix"']`, @alphabravo , you'll get more votes if you demonstrate it working. – Russia Must Remove Putin Jun 17 '14 at 15:54
  • So this version splits the 'four, 4' in 2, whereas the other version keeps them together. Not sure which version is correct in this case. – PandaConda Jun 17 '14 at 15:58
  • I am not aware of a CSV file that uses double and single quotes, but updated my pattern to solve this issue. – alpha bravo Jun 17 '14 at 16:13
2

Why do you say that regex drops empty field ? The answer from Alan More in referenced post suggested

re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)

I tried it (after changing ; with ,) and found ['1', '"2"', 'three', "'four, 4'", '', '"6\tsix"'] what is you said to expect

Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252