Spark python how group RDD items by using special mark?

Question

I have a huge file (a.txt) as the following in which there is a special mark in file that divides the data into group

a1
a2
$$$$$$$$
a1
c1
b1
c2
$$$$$$$$
d1
d2
$$$$$$$$
...

I want to use python code like:

line = sc.textFile("a.txt")
line1 = line.filter() or line.filter.map()...
...

to divide the group items into several groups like the following: (a1,a2), (a1, c1, b1,c2), (d1, d2)....but could not figure out how to do it, can somebody help?

@zero323, I have searched related topics, could not find the duplicated question. If you found it, please give me the link. Thanks. This one is NOT duplicated!! — Olaf apple, Jul 12 '16 at 14:33
You want to combine records based on a specific delimiter, right? This should be done on read what is explained in the linked question. — zero323, Jul 12 '16 at 14:47
a1, a2, $$$$$$$$ ... are NOT in one line. Linked question is different. — Olaf apple, Jul 12 '16 at 14:56
@zero323, I found the link you mentioned, I will double check it, thank you! — Olaf apple, Jul 12 '16 at 15:04

score 1 · Answer 1 · answered Jul 12 '16 at 03:48

1

import itertools
df = pd.read_clipboard(header=None)
mn = df[0].tolist()
def isplit(iterable,splitters):
    return [list(g) for k,g in itertools.groupby(iterable,lambda x:x in splitters) if not k]
isplit(mn, ('$$$$$$$$',))
Out[84]: [['a1', 'a2'], ['a1', 'c1', 'b1', 'c2'], ['d1', 'd2']]

answered Jul 12 '16 at 03:48

MaThMaX

1,995
1
12
23

MaThMaX, thanks for the reply. line is RDD format data, do you have a good answer? actually file a.txt is a very huge file >10G. I want to get a direct answer related to RDD format data and filter, or other function. – Olaf apple Jul 12 '16 at 03:59
@Olafapple, I think this a already an another question... I have no experience with using the [`Spark`](http://spark.apache.org/docs/latest/programming-guide.html). But if you would like to use pandas, you can read [How to work with BigData using Pandas](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas). – MaThMaX Jul 12 '16 at 04:10

Spark python how group RDD items by using special mark?

1 Answers1