Get the top two elements in a nested list - pyspark

Question

Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]] Using pyspark I want to be able to remove the third element so that it will look like this:

[a,2]
[a,3]
[b,4]
[b,8]

I am new to pyspark and not sure what I should do here.

I do not see nested lists, I see a list of tuples. And what happened to (b,9) - it is the last element, not the third but still vanished .... — Patrick Artner, Mar 18 '18 at 18:58
[/how-to-remove-an-element-from-a-list-by-index-in-python](https://stackoverflow.com/questions/627435/how-to-remove-an-element-from-a-list-by-index-in-python) and [remove-an-element-from-a-python-list-of-lists-in-pyspark-dataframe](https://stackoverflow.com/questions/41624567/remove-an-element-from-a-python-list-of-lists-in-pyspark-dataframe) and [understanding-pythons-slice-notation](https://stackoverflow.com/questions/509211/understanding-pythons-slice-notation) — Patrick Artner, Mar 18 '18 at 19:10
and [how-to-remove-multiple-indexes-from-a-list-at-the-same-time](https://stackoverflow.com/questions/11303225/how-to-remove-multiple-indexes-from-a-list-at-the-same-time) and ... some more Q all about list manipulation. — Patrick Artner, Mar 18 '18 at 19:11
Just to clarify, I need it to remove the third element of each group. Here each group is defined by the index of the nested list so letter a and letter b. Also the actions will be performed on an RDD. This means i need to use pyspark. — MATT SHALLOW, Mar 19 '18 at 08:48

score 0 · Accepted Answer · answered Mar 19 '18 at 03:35

0

You can try something like this.
The first step is groupby key column and aggregate values in a list. Then use a udf to get the first two values of the list and then explode that column.

df = sc.parallelize([('a',2),('a',3),('a',4),
                       ('b',4),('b',8),('b',9)]).toDF(['key', 'value'])
from pyspark.sql.functions import collect_list, udf, explode
from pyspark.sql.types import *

foo = udf(lambda x:x[0:2], ArrayType(IntegerType()))
df_list = (df.groupby('key').agg(collect_list('value')).
                   withColumn('values',foo('collect_list(value)')).
                   withColumn('value', explode('values')).
                   drop('values', 'collect_list(value)'))
df_list.show()

result

+---+-----+
|key|value|
+---+-----+
|  b|    4|
|  b|    8|
|  a|    2|
|  a|    3|
+---+-----+

answered Mar 19 '18 at 03:35

pauli

4,191
2
25
41

This worked for me. I was hoping there would be a much cleaner solution that did not require data frames. Maybe something using reduceByKey and groupBy but there does not seem to be one.Thank you! – MATT SHALLOW Mar 19 '18 at 14:14
I have one additional question. How would this solution work with three columns. I am trying to add a third column that needs to be treated the same as "value" in the solution above but the output is not being printed in the right order. – MATT SHALLOW Mar 19 '18 at 16:07
Can you elaborate the right order? did you mean ascending or descending order? – pauli Mar 19 '18 at 17:33
I just mean when I add another column to your example code it does not work. I added to the new value in the .agg() and also created new columns using withColumn() . When I show the list all column data is mixed up even and it no longer shows the top five. – MATT SHALLOW Mar 19 '18 at 17:59
can you edit the question and add this part with its model solution? – pauli Mar 20 '18 at 03:04
Hi @ashwinids, I actually managed to get it working this morning. Instead of making a third column I just combined the third piece of data with the second. Once again I appreciate the help! – MATT SHALLOW Mar 21 '18 at 11:29

Get the top two elements in a nested list - pyspark

1 Answers1