0

I am very new to spark. I am trying to play with rdds. So here is my basic rdd

rdd=sc.parallelize(['"ab,cd",9', 'xyz,6'])

Now if I want to split it on commas I do

rdd.map(lambda x:x.split(",")).collect()

which gives me

[['ab', 'cd', '9'], ['xyz', '6']]

Since I want to ignore the commas in between the text placed in "", I write

rdd.map(lambda x:x.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)")).collect()   

which gives the output

[['ab,cd,9'], ['xyz,6']] (Thus this is not a duplicate question)

But I want the output similar to what I get with .split(",") like so

[['ab,cd','9'], ['xyz','6']]

I am not very good with regex and so I do not know how to manipulate it to get that output. Any help will be greatly appreciated

learning_dev
  • 115
  • 2
  • 13
  • Please format your question so it is readable. – user3483203 Mar 27 '18 at 15:32
  • Possible duplicate of [How to split but ignore separators in quoted strings, in python?](https://stackoverflow.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python) – pault Mar 27 '18 at 16:07
  • @pault, no it is not a duplicate and I have mentioned in my question that the answer in the link you gave doesn't work for my case – learning_dev Mar 27 '18 at 19:04

1 Answers1

1

You can use this answer and modify the pattern for , instead of ;:

import re
pattern = r"""((?:[^,"']|"[^"]*"|'[^']*')+)"""
rdd.map(lambda x: re.split(pattern , x)[1::2]).collect()
#[['"ab,cd"', '9'], ['xyz', '6']]

The [1::2] means take every other item in the list, starting at index 1. More on understanding python's slice notation.

This pattern matches fields (not the delimiter), so without the slice, you'd get:

[['', '"ab,cd"', ',', '9', ''], ['', 'xyz', ',', '6', '']]

Update

If you only wanted to ignore the separator in between double quotes (and not single quotes), you can modify the pattern as follows:

pattern = r"""((?:[^,"]|"[^"]*")+)"""
rdd=sc.parallelize(["xy'z,6",'"ab,cd",5'])
rdd.map(lambda x: re.split(pattern , x)[1::2]).collect()
#[["xy'z", '6'], ['"ab,cd"', '5']]
pault
  • 41,343
  • 15
  • 107
  • 149
  • Wonderful worked perfectly for me. Although, I will take some time to understand this thoroughly. Thanks a lot – learning_dev Mar 27 '18 at 18:35
  • What if in my rdd I also have `'` like `rdd=sc.parallelize(["xy'z,6",'"ab,cd",5'])` – learning_dev Mar 27 '18 at 18:59
  • @user7623678 in that case you'd have to modify the pattern by removing the logic for single quotes. This should work: `pattern = r"""((?:[^,"]|"[^"]*")+)"""` – pault Mar 27 '18 at 19:22