1

I'm trying to find a regex that splits the string a below into a list. I haven't yet found a foolproof way of splitting the string but the main reason for asking is that I cannot understand why the last string is being duplicated. It does not happen when I'm testing online at regex101.com. To my understanding there should be no reason to duplicate data due to the re.split function.

The code is:

import re
a = ['"This is a string", "and this is another with a , in it", Thisisalsovalid, "",,,"And a string"']
b = re.split(r',(?=(".*?"|[\w/-]*|,))', a[0])
for i in b:
    print(i)

and the output:

"This is a string"

 "and this is another with a

 in it"

 Thisisalsovalid

 ""




"And a string"
"And a string"

The expected output is:

"This is a string"
"and this is another with a , in it"
Thisisalsovalid
""


"And a string"

The list is to be zipped with a list with headers without indexing problems.

As a bonus I would gladly get a regex that splits on ',' except when it occurs in a string.

Bengt62
  • 137
  • 1
  • 7
  • I don't know why the last match is duplicated, but I can contribute a [pattern](https://regex101.com/r/zR7uR1/1). It only matches commas that are followed by an even number of quotes. – Aran-Fey Dec 10 '14 at 13:46
  • One simple answer is ,(?! ) but that is error prone since there is no guarantee for the space in the real data. And still, the reason for the duplication is what puzzles me most. – Bengt62 Dec 10 '14 at 13:46
  • 1
    This is not a duplicate.OP wants to know the reason of repitition as well.Nominating for reopening. – vks Dec 10 '14 at 13:50
  • @vks i think you know about the duplicate question. Why you fail to mark this as dulicate? – Avinash Raj Dec 10 '14 at 13:52
  • @Bengt62 i think you should get the answer from this `\s*,\s*(?=(?:[^"]*"[^"]*")*[^"]*$)` regex. If no, then i'll reopen this question.. – Avinash Raj Dec 10 '14 at 13:53
  • @AvinashRaj I want the question re-opened as the main question has not been answered, why parts of the data gets duplicated. To approve an answer, this is what I want an answer to. However, both your regex, as well as the csv solution below works for my data so the secondary question is answered. – Bengt62 Dec 10 '14 at 14:20

2 Answers2

0
,(?=(?:[^"]*""?[^"]*")*[^"]*$)

Try this.See demo.

https://regex101.com/r/nL5yL3/36

Yours can work if

b = re.split(r',(?=(?:".*?"|[\w/-]*|,))', a[0])

                    ^^

Use this.Duplicates are appearing because you have grouped as well.split returns the grouped elements as well.So make it non capturing.

vks
  • 67,027
  • 10
  • 91
  • 124
0

Why not use an existing solution to read csv formatted strings?

import csv
import StringIO
s = ['"This is a string", "and this is another with a , in it", Thisisalsovalid, "",,,"And a string"']
reader = csv.reader(StringIO.StringIO(s[0]), skipinitialspace=True)
for row in reader:
    for value in row:
        print value

Output:

This is a string
and this is another with a , in it
Thisisalsovalid



And a string
Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • the output is not OP expected, is it? – Kent Dec 10 '14 at 13:49
  • Although this is a solution that works, it didn't answer the main question, therefore accepting @vks answer instead but this gave me new insights as well, thanks. – Bengt62 Dec 10 '14 at 14:28
  • No problem, that's what you should do. This is for the sake of people struggling with regular expressions and ending up here, when they should be using an existing solution. – Reut Sharabani Dec 10 '14 at 14:39