2

Using PySpark.

Follow up: I think I only need to know how to select n elements after an element in a list, and join them with the list itself.

For example, you have a list 'a','b','c','d','e','f','g'

+-------+-----+
| _index| item|
+-------+-----+
|   0   |   a |
|   1   |   b |
|   2   |   c |
|   3   |   d |
|   4   |   e |
|   5   |   f |
|   6   |   g |
+-------+-----+

of index 0 to 6; and we want to join, say n=3 elements after 'c', with the list itself, and we get

+--------+-------+-------+
| _index | item1 | item2 |
+--------+-------+-------+
|   3    |   d   |   d   |
|   4    |   e   |   e   |
|   5    |   f   |   f   |
+--------+-------+-------+

The following is one piece of related code. Is it possible to modify this code to pick elements after A within a distance n and join them with the list that contains A? I am new to spark I would like some help! Thanks!


Suppose we have many lists. We first find an element in these lists with some condition condition1. Give it alias A.

If we randomly pick another element after A's index (within certain index distance, say 1-3), and then join it with the list that contains A, then we can do the following.

df.where(
    (col('condition1')==0) # finds an element satisfying some condition, name it as 'A'
).alias('A').join(
    df.alias('B'), 
    # randomly pick another element after 'A' within index distance 1 to 3
    # and join it with the list that contains 'A'
    ((col('A.ListId')==col('B.ListId')) & (random.randint(1,4)+col('A._index'))==col('B._index'))
)
Tony
  • 1,225
  • 3
  • 12
  • 26
  • 1
    This is all very abstract- it would be helpful if you could provide a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) with small representative input DataFrame and the desired output. – pault Feb 25 '19 at 15:57
  • I don't think there's a way to do this without a Cartesian product. – pault Feb 25 '19 at 18:47

1 Answers1

1

Here is a sample for possible workaround that you could apply:

l = [(0,"a"), (1,"b"), (2,"c"), (3,"d"), (4,"e"), (5,"f"), (6,"g")]
df = spark.createDataFrame(l, schema=["_index", "item"])

# just get the value out of the row
start = df.filter(df.item == "c").select("_index").first()[0]
df.filter((df._index > start) & (df._index <= random.randint(start + 1, start + 4))).show()

So I think the only part was missing, apart from your join, was getting the integer from A's index.

aysegulpekel
  • 324
  • 1
  • 4