1

I am trying to understand re.split() function with non-capturing group to split a comma delimited string.

This is my code:

 pattern = re.compile(r',(?=(?:"[^"]*")*[^"]*$)')
 text = 'qarcac,"this is, test1",123566'
 results= re.split(pattern, text)
 for r in results:
    print(r.strip())

When I execute this code, the results are as expected.

split1: qarcac

split2: "this is, test1"

split3: 123566

whereas if i add one more double quoted string to the source text, it doesn't work as expected.

text = 'qarcac,"this is, test1","this is, test2", 123566, testdata'

and produces the below output

split1: qarcac,"this is, test1"

split2: "this is, test2"

split3: 123566

Can someone explain me what's going on here and how non-capturing group works differently in these two cases?

AngiSen
  • 915
  • 4
  • 18
  • 41
  • 1
    You should use a `csv` module to parse CSV string. The regex you are using is very inefficient, and in case the string is very long, the performance might drop significantly. – Wiktor Stribiżew Aug 05 '18 at 11:38
  • thanks Wiktor, I am not going to productionize it, instead trying to learn as i had come across this code in one of my learning modules. – AngiSen Aug 05 '18 at 11:42
  • 1
    The pattern that works is `,(?=(?:"[^"]*"|[^"])*$)`. Or `,(?=[^"]*(?:"[^"]*"[^"]*)*$)`. See [Regex to pick commas outside of quotes](https://stackoverflow.com/questions/632475/regex-to-pick-commas-outside-of-quotes). – Wiktor Stribiżew Aug 05 '18 at 11:45
  • 1
    See https://regex101.com/r/dRqJZT/1, there is a good explanation of any regex you type into pattern field on the right. – Wiktor Stribiżew Aug 05 '18 at 11:58
  • thanks Wiktor.. how does re.split() marks the first occurrence of comma in the source string using the following regex when [^"]* is used..... ,(?=(?:"[^"]*"|[^"])*$) – AngiSen Aug 05 '18 at 12:10

1 Answers1

1

This has nothing to do with (non-)capturing groups.

(?:"[^"]*")*[^"]*$ matches:

  • "[^"]*" - a quoted string (two quotes with 0 or more non-quotes in between)
  • (?: ... )* - 0 or more of those quoted strings
  • [^"]* - followed by 0 or more non-quotes
  • $ - followed by the end of the string

In other words, this regex matches something like "foo""bar""baz"otherstuff.

In your first example, the target string is:

qarcac,"this is, test1",123566
       ^^^^^^^^^^^^^^^^^^^^^^^

I've underlined the part that is matched by the above regex (a quoted part followed by an unquoted tail followed by the end of the string).

In your second example, the target string is:

qarcac,"this is, test1","this is, test2", 123566, testdata
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Again, I've underlined the part that is matched by the regex.

The first quoted part is not matched because of the comma:

"this is, test1","this is, test2"
                X

"foo","bar" is not matched because your regex requires the quoted parts to be right next to each other, as in "foo""bar", with nothing in between.


If you just want to make sure that every matched comma is outside of a quoted part (i.e. is followed by an even number of quotes), you can simply use

,(?=[^"]*(?:"[^"]*"[^"]*)*$)

as your regex.

melpomene
  • 84,125
  • 8
  • 85
  • 148