I have a RDD which I created using PySpark and sizes around 600 GB after joining by key value which looks exactly like this.
[('43.72_-70.08', (('0744632', -70.08, 43.72, '2.4'), '18090865')),
('43.72_-70.08', (('0744632', -70.08, 43.72, '2.4'), '18090865')),
('43.25_-67.58', (('0753877', -67.58, 43.25, '7.2'), '18050868')),
('43.01_-75.24', (('0750567', -75.24, 43.01, '7.2'), '18042872'))]
I want something like this and sorted by the first element:
['0744632', '18090865', '2.4',
'0744632', '18090865', '2.4',
'0750567', '18042872', '7.2',
'0753877', '18050868', '7.2']
Is there a way I can get data from tuples out and get the output in required format.
Note: This is a 600 GB RDD, with more than a million different values in first column and approx. 15 billion rows, I would really appreciate an optimized way if possible.