Remove the tuple and create a new sorted list

Question

I have a RDD which I created using PySpark and sizes around 600 GB after joining by key value which looks exactly like this.

[('43.72_-70.08', (('0744632', -70.08, 43.72, '2.4'), '18090865')),
 ('43.72_-70.08', (('0744632', -70.08, 43.72, '2.4'), '18090865')),
 ('43.25_-67.58', (('0753877', -67.58, 43.25, '7.2'), '18050868')),
 ('43.01_-75.24', (('0750567', -75.24, 43.01, '7.2'), '18042872'))]

I want something like this and sorted by the first element:

['0744632', '18090865', '2.4',
'0744632', '18090865', '2.4',
'0750567', '18042872', '7.2',
'0753877', '18050868', '7.2']

Is there a way I can get data from tuples out and get the output in required format.

Note: This is a 600 GB RDD, with more than a million different values in first column and approx. 15 billion rows, I would really appreciate an optimized way if possible.

That's not a valid list. Error, tuples are not callable, probably missing a comma. — user3483203, Apr 27 '18 at 02:43
Try this `new_data = [ [y[1][0][0],y[1][1],y[1][0][3]] for y in tuple_list]` `sorted_data = sorted(new_data)` — Hungry Mind, Apr 27 '18 at 02:53
@HungryMind Your code works but is applicable for lists, mine is RDD, I should have been more clear on that. My bad. — Sami, Apr 27 '18 at 03:46

AChampion · Accepted Answer · 2018-04-27T03:33:53.423

0

Do this in your spark cluster, e.g.:

In []:
(rdd.map(lambda x: (x[1][0][0], x[1][1], x[1][0][2]))
 .sortBy(lambda x: x[0])
 .flatMap(lambda x: x)
 .collect())

Out[]:
['0744632', '18090865', 43.72, '0744632', '18090865', 43.72, '0750567', 
 '18042872', 43.01, '0753877', '18050868', 43.25]

Alternatively

In []:
import operator as op

(rdd.map(lambda x: (x[1][0][0], x[1][1], x[1][0][2]))
 .sortBy(lambda x: x[0])
 .reduce(op.add))

Out[]:
('0744632', '18090865', 43.72, '0744632', '18090865', 43.72, '0750567', 
 '18042872', 43.01, '0753877', '18050868', 43.25)

This seems like a rather unwieldy structure, if you meant a list of tuples then simply eliminate the flatMap():

In []:
(rdd.map(lambda x: (x[1][0][0], x[1][1], x[1][0][2]))
 .sortBy(lambda x: x[0])
 .collect())

Out[]:
[('0744632', '18090865', 43.72),
 ('0744632', '18090865', 43.72),
 ('0750567', '18042872', 43.01),
 ('0753877', '18050868', 43.25)]

edited Apr 27 '18 at 03:33

answered Apr 27 '18 at 03:01

AChampion

29,683
4
59
75

I got this error ... Is it because I am running out of memory? **Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 42 tasks (1107.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)** – Sami Apr 27 '18 at 04:13
Change the size of your max result size `sc.getConf().set("spark.driver.maxResultSize", "2g")` – AChampion Apr 27 '18 at 04:21
This is not helping, I still get the same error. Can you please help me with this error. – Sami Apr 27 '18 at 15:11

score 0 · Answer 2 · answered Apr 27 '18 at 03:27

0

This is a simple one line solution

sorted([(x[1][0][0], x[1][1], x[1][0][3]) for x in your_list])

I think it's slightly faster than a lambda solution based on this post What is the difference between these two solutions - lambda or loop - Python

answered Apr 27 '18 at 03:27

Kenan

13,156
8
43
50

1

*I have an RDD can't use this one, but this do work for lists* – Sami Apr 27 '18 at 03:47

score 0 · Answer 3 · answered Apr 27 '18 at 08:33

0

Remove the tuple and create a new sorted list

3 Answers3