2

I have a huge file (a.txt) as the following in which there is a special mark in file that divides the data into group

a1
a2
$$$$$$$$
a1
c1
b1
c2
$$$$$$$$
d1
d2
$$$$$$$$
...

I want to use python code like:

line = sc.textFile("a.txt")
line1 = line.filter() or line.filter.map()...
...

to divide the group items into several groups like the following: (a1,a2), (a1, c1, b1,c2), (d1, d2)....but could not figure out how to do it, can somebody help?

Olaf apple
  • 75
  • 1
  • 10

1 Answers1

1
import itertools
df = pd.read_clipboard(header=None)
mn = df[0].tolist()
def isplit(iterable,splitters):
    return [list(g) for k,g in itertools.groupby(iterable,lambda x:x in splitters) if not k]
isplit(mn, ('$$$$$$$$',))
Out[84]: [['a1', 'a2'], ['a1', 'c1', 'b1', 'c2'], ['d1', 'd2']]
MaThMaX
  • 1,995
  • 1
  • 12
  • 23
  • MaThMaX, thanks for the reply. line is RDD format data, do you have a good answer? actually file a.txt is a very huge file >10G. I want to get a direct answer related to RDD format data and filter, or other function. – Olaf apple Jul 12 '16 at 03:59
  • @Olafapple, I think this a already an another question... I have no experience with using the [`Spark`](http://spark.apache.org/docs/latest/programming-guide.html). But if you would like to use pandas, you can read [How to work with BigData using Pandas](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas). – MaThMaX Jul 12 '16 at 04:10