Pyspark, how to split when there are several delimiters in one column

Question

rdd1=sc.textFile('/user/training/checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[1],fields[3],fields[5]), 1) )

I used above command to get only value based on fields[1],fields[3] and fields[5].

The values below is I got as a real output because second column in input file includes several commas which I used for splitting the line. How can I split the data when there are several delimiter ? Or is there some way to drop the tables I do not want to use? I want to delete several columns have long string which makes this problem happen

[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1),
 ((u'3011076', u' 1481425749', u' 9781481425742"'), 1),
 ((u'2248846', u' c1999."', u'"'), 1)]

I am expecting following output.

   [((u'BibNum', u'ItemCollection', u'ItemLocation'), 1),
     ((u'3011076', u' qna, u' ncrdr"'), 1),
     ((u'2248846', u' qkb."', ncstr'"'), 1)]

I will upload sample input values for your understanding for my problem,

3011076,
"A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.",
"O'Ryan, Ellie",
"1481425730, 1481425749, 9781481425735, 9781481425742",
2014.,
"Simon Spotlight,",
"Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction",
jcbk,
ncrdr,
Floating,
qna,
09/01/2017,
1

As you can see, in this sample input, in the second line, it includes lots of commas which it keep me from splitting.

score 1 · Accepted Answer · answered Apr 14 '18 at 02:28

If I'm reading this correctly, and the sample data is not split across multiple lines but looks something like 3011076,"A tale of two friends / adapted... then it looks like you should be able to use a CSV parser to load your data. CSV stands for comma-separated values and typically looks something like:

name,value
foo,10
bar,20

but of course a name might have comma in it so there are methods for enclosing them using double quotes

name,value
foo,10
bar,20
"baz,qux",40

So annoying if you want to split on commas but you're in luck that nearly every CSV parser will work for you.

Check out spark-csv for a DataFrame approach or the the Python CSV library.

With RDDs and Python CSV:

import csv
rdd1 = sc.textFile('/user/training/checkouts') \
    .map(lambda line: next(csv.reader([line]))) \
    .map(lambda fields:((fields[1],fields[3],fields[5]), 1))

However I highly recommend checking out the spark-csv library because you'll likely get much better performance with it.

df = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .load('/user/training/checkouts')
df.select(...)

df = sqlContext.read \.. doesnt seem to work with spark 2.3.1, even I add in the quotes and escape setting. — Sade, Oct 30 '18 at 13:18

Pyspark, how to split when there are several delimiters in one column

1 Answers1