2

Let's say I have a PySpark data frame like so:

1 0 1 0
0 0 1 1
0 1 0 1

How can I append the last and next column of a row to the current row, like so:

1 0 1 0 0 0 0 0 0 0 1 1
0 0 1 1 1 0 1 0 0 1 0 1
0 1 0 1 0 0 1 1 0 0 0 0

I'm familiar with the .withColumn() method for adding columns, but am not sure what I would put in this field.

The "0 0 0 0" are placeholder values because there are no prior or subsequent rows before and after those rows.

pault
  • 41,343
  • 15
  • 107
  • 149
Chris C
  • 599
  • 2
  • 8
  • 19
  • 1
    Can you make a more realistic example? Generally it's not a good idea to have `a`, `1` and `!` in one column, Same goes for other columns. With that said, `withColumn`, `lead` and `lag` should do what you need. – Psidom Jul 10 '18 at 15:06
  • I'm sure you can imagine more realistic placeholders in my example. I just put them for ease of distinction. – Chris C Jul 10 '18 at 15:07
  • Your example is confusing. Are `a, b, c, d` the column names? Where does `0 0 0 0` come from? See [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) @Chris. – pault Jul 10 '18 at 15:11
  • @pault Okay I'll change the example. The "0 0 0 0" are placeholder values because there are no prior or subsequent rows before and after those rows. – Chris C Jul 10 '18 at 15:11

1 Answers1

7

You can use pyspark.sql.functions.lead() and pyspark.sql.functions.lag() but first you need a way to order your rows. If you don't already have a column that determines the order, you can create one using pyspark.sql.functions.monotonically_increasing_id()

Then use this in conjunction with a Window function.

For example, if you had the following DataFrame df:

df.show()
#+---+---+---+---+
#|  a|  b|  c|  d|
#+---+---+---+---+
#|  1|  0|  1|  0|
#|  0|  0|  1|  1|
#|  0|  1|  0|  1|
#+---+---+---+---+

You could do:

from pyspark.sql import Window
import pyspark.sql.functions as f

cols = df.columns
df = df.withColumn("id", f.monotonically_increasing_id())
df.select(
    "*", 
    *([f.lag(f.col(c),default=0).over(Window.orderBy("id")).alias("prev_"+c) for c in cols] + 
      [f.lead(f.col(c),default=0).over(Window.orderBy("id")).alias("next_"+c) for c in cols])
).drop("id").show()
#+---+---+---+---+------+------+------+------+------+------+------+------+
#|  a|  b|  c|  d|prev_a|prev_b|prev_c|prev_d|next_a|next_b|next_c|next_d|
#+---+---+---+---+------+------+------+------+------+------+------+------+
#|  1|  0|  1|  0|     0|     0|     0|     0|     0|     0|     1|     1|
#|  0|  0|  1|  1|     1|     0|     1|     0|     0|     1|     0|     1|
#|  0|  1|  0|  1|     0|     0|     1|     1|     0|     0|     0|     0|
#+---+---+---+---+------+------+------+------+------+------+------+------+
Manu CJ
  • 2,629
  • 1
  • 18
  • 29
pault
  • 41,343
  • 15
  • 107
  • 149