I have a RDD from logged events I wanted to take few samples of each category.
Data is like below
|xxx|xxxx|xxxx|type1|xxxx|xxxx
|xxx|xxxx|xxxx|type2|xxxx|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type3|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type3|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type4|xxxx|xxxx|xxxx|xxxx|xxxx
|xxx|xxxx|xxxx|type1|xxxx|xxxx
|xxx|xxxx|xxxx|type6|xxxx
My try
eventlist = ['type1', 'type2'....]
orginalRDD = sc.textfile("/path/to/file/*.gz").map(lambda x: x.split("|"))
samplelist = []
for event in event list:
eventsample = orginalRDD.filter(lambda x: x[3] == event).take(5).collect()
samplelist.extend(eventsample)
print samplelist
I have two questions on this,
1. Some better way/efficient way to collect sample based on specific condition?
2. Is it possible to collect the unsplit lines instead of splitted lines?
Python or scala suggestion are welcome!