pyspark split on delimiter ignoring double quotes using regex

Question

I am very new to spark. I am trying to play with rdds. So here is my basic rdd

rdd=sc.parallelize(['"ab,cd",9', 'xyz,6'])

Now if I want to split it on commas I do

rdd.map(lambda x:x.split(",")).collect()

which gives me

[['ab', 'cd', '9'], ['xyz', '6']]

Since I want to ignore the commas in between the text placed in "", I write

rdd.map(lambda x:x.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)")).collect()

which gives the output

[['ab,cd,9'], ['xyz,6']] (Thus this is not a duplicate question)

But I want the output similar to what I get with .split(",") like so

[['ab,cd','9'], ['xyz','6']]

I am not very good with regex and so I do not know how to manipulate it to get that output. Any help will be greatly appreciated

Possible duplicate of [How to split but ignore separators in quoted strings, in python?](https://stackoverflow.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python) — pault, Mar 27 '18 at 16:07
@pault, no it is not a duplicate and I have mentioned in my question that the answer in the link you gave doesn't work for my case — learning_dev, Mar 27 '18 at 19:04

pault · Accepted Answer · 2018-03-27T19:25:19.463

1

You can use this answer and modify the pattern for , instead of ;:

import re
pattern = r"""((?:[^,"']|"[^"]*"|'[^']*')+)"""
rdd.map(lambda x: re.split(pattern , x)[1::2]).collect()
#[['"ab,cd"', '9'], ['xyz', '6']]

The [1::2] means take every other item in the list, starting at index 1. More on understanding python's slice notation.

This pattern matches fields (not the delimiter), so without the slice, you'd get:

[['', '"ab,cd"', ',', '9', ''], ['', 'xyz', ',', '6', '']]

Update

If you only wanted to ignore the separator in between double quotes (and not single quotes), you can modify the pattern as follows:

pattern = r"""((?:[^,"]|"[^"]*")+)"""
rdd=sc.parallelize(["xy'z,6",'"ab,cd",5'])
rdd.map(lambda x: re.split(pattern , x)[1::2]).collect()
#[["xy'z", '6'], ['"ab,cd"', '5']]

edited Mar 27 '18 at 19:25

answered Mar 27 '18 at 16:12

pault

41,343
15
107
149

Wonderful worked perfectly for me. Although, I will take some time to understand this thoroughly. Thanks a lot – learning_dev Mar 27 '18 at 18:35
What if in my rdd I also have `'` like `rdd=sc.parallelize(["xy'z,6",'"ab,cd",5'])` – learning_dev Mar 27 '18 at 18:59
@user7623678 in that case you'd have to modify the pattern by removing the logic for single quotes. This should work: `pattern = r"""((?:[^,"]|"[^"]*")+)"""` – pault Mar 27 '18 at 19:22

pyspark split on delimiter ignoring double quotes using regex

1 Answers1