Using PySpark.
Follow up: I think I only need to know how to select n
elements after an element in a list, and join them with the list itself.
For example, you have a list 'a','b','c','d','e','f','g'
+-------+-----+
| _index| item|
+-------+-----+
| 0 | a |
| 1 | b |
| 2 | c |
| 3 | d |
| 4 | e |
| 5 | f |
| 6 | g |
+-------+-----+
of index 0 to 6; and we want to join, say n=3
elements after 'c', with the list itself, and we get
+--------+-------+-------+
| _index | item1 | item2 |
+--------+-------+-------+
| 3 | d | d |
| 4 | e | e |
| 5 | f | f |
+--------+-------+-------+
The following is one piece of related code. Is it possible to modify this code to pick elements after A
within a distance n
and join them with the list that contains A
? I am new to spark I would like some help! Thanks!
Suppose we have many lists. We first find an element in these lists with some condition condition1
. Give it alias A
.
If we randomly pick another element after A
's index (within certain index distance, say 1-3
), and then join it with the list that contains A
, then we can do the following.
df.where(
(col('condition1')==0) # finds an element satisfying some condition, name it as 'A'
).alias('A').join(
df.alias('B'),
# randomly pick another element after 'A' within index distance 1 to 3
# and join it with the list that contains 'A'
((col('A.ListId')==col('B.ListId')) & (random.randint(1,4)+col('A._index'))==col('B._index'))
)