0

My df has multiple columns

Query I tried:

df=df.withColumn('Column_required',F.when(df.Column_present>1,df.Column_present).otherwise(lag(df.Column_present))

Not able to work on otherwise. . Column on which I want operation:

Column_present       Column_required
40000                 40000
Null                  40000
Null                  40000
500                   500
Null                  500
Null                  500
Sandy
  • 163
  • 3
  • 14
  • Please try to make your question more readable by using the formatting options, it's kind of hard to understand what exactly you are trying to achieve... Edit: I see you did that while I typed my comment :) – Thomas Oct 11 '18 at 12:08
  • These might help [(1) fill with last observation](https://stackoverflow.com/questions/36019847/pyspark-forward-fill-with-last-observation-for-a-dataframe) and [forward fill missing values](https://stackoverflow.com/questions/38131982/forward-fill-missing-values-in-spark-python/50422240#50422240) – Merelda Oct 11 '18 at 12:46

1 Answers1

3

I think your solution might be the usage of last instead of lag:

df_new = spark.createDataFrame([
(1, 40000), (2, None),  (3,None), (4,None),
(5,500), (6,None), (7,None)
], ("id", "Col_present"))

df_new.withColumn('Column_required',when(df_new.Col_present>1,df_new.Col_present).otherwise(last(df_new.Col_present,ignorenulls=True).over(Window.orderBy("id")))).show()

This will produce your desired output:

+---+-----------+---------------+
| id|Col_present|Column_required|
+---+-----------+---------------+
|  1|      40000|          40000|
|  2|       null|          40000|
|  3|       null|          40000|
|  4|       null|          40000|
|  5|        500|            500|
|  6|       null|            500|
|  7|       null|            500|
+---+-----------+---------------+

But be aware, that the window function requires a column to perform the sorting. That's why I used the id column in the example. You can create an id column by yourself, if your dataframe does not contain a sortable column with monotonically_increasing_id().

gaw
  • 1,960
  • 2
  • 14
  • 18
  • Thanks it worked!! could you please explain what .over(Window) does? – Sandy Oct 11 '18 at 13:53
  • The lag function requires, a kind of list of values to chose the previous value. The `window` can be used to create this list and in this case it just takes the data performs the ordering and produces this set. So usually you should not apply a `lag` function without a `window`, since it does not know from where to take the previous value.... I hope it get's a bit clearer now – gaw Oct 11 '18 at 13:57
  • Hello gaw, your solution worked well on jupyter, but when I batch submitted it on terminal it gave me error (that ignorenulls is an unexpected keyword), could you please tell me how could I also run it on batch submit? Thanks for help – Sandy Oct 12 '18 at 13:09
  • If you want to run it on batch, you have to make sure that you import the `sql.functions` so that it knows the function. If you then use the same spark version as in your jupyter it should work fine. Unfortunately, the solution relies on the ignorenulls, since you want to consider the numeric values only. You can also try to run the code without specifying the keyword and just pass `True` as a second parameter: `...last(df_new.Col_present,True).over...` – gaw Oct 15 '18 at 06:07