rdd1=sc.textFile('/user/training/checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[1],fields[3],fields[5]), 1) )
I used above command to get only value based on fields[1],fields[3] and fields[5].
The values below is I got as a real output because second column in input file includes several commas which I used for splitting the line. How can I split the data when there are several delimiter ? Or is there some way to drop the tables I do not want to use? I want to delete several columns have long string which makes this problem happen
[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1),
((u'3011076', u' 1481425749', u' 9781481425742"'), 1),
((u'2248846', u' c1999."', u'"'), 1)]
I am expecting following output.
[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1),
((u'3011076', u' qna, u' ncrdr"'), 1),
((u'2248846', u' qkb."', ncstr'"'), 1)]
I will upload sample input values for your understanding for my problem,
3011076,
"A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.",
"O'Ryan, Ellie",
"1481425730, 1481425749, 9781481425735, 9781481425742",
2014.,
"Simon Spotlight,",
"Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction",
jcbk,
ncrdr,
Floating,
qna,
09/01/2017,
1
As you can see, in this sample input, in the second line, it includes lots of commas which it keep me from splitting.