Given dataset
foo | bar |
---|---|
1 | 2 |
null | null |
3 | 4 |
How to replace foo
bar
at once if one of them or both are nulls with something else, like (5, 6)?
foo | bar |
---|---|
1 | 2 |
5 | 6 |
3 | 4 |
This applicable for geo datasets, when lat/lng not known and should be obtained somewhere else. So udf is time consuming and I'd like to be sure its called only for necessary rows (where both foo and bar are null)
The following code
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, udf
spark = (SparkSession.builder.master("local")
.appName("SimpleApp")
.getOrCreate()
)
def my_udf(): return 0, 0
df = spark.createDataFrame([[1, 2], [None, None], [3, 4]], schema=['foo', 'bar'])
df = df.withColumn("result", when(df['foo'].isNull() | df['bar'].isNull(), udf(my_udf)()))
df.show()
is bad try
foo | bar | result |
---|---|---|
1 | 2 | null |
null | null | [Ljava.lang.Objec... |
3 | 4 | null |
So it's necessary to unpack the array somehow to columns.
Considering this it can not be done in one step Apache Spark -- Assign the result of UDF to multiple dataframe columns
But also even if I'll return struct and will unpack it, how to leave not influenced columns alone?
The other approach I have tried (considering I need further processing of foo
bar
):
def some(baz): return 'some'
def something_else(foo, bar): return 'something'
def my_udf(_):
foo, bar, baz = _
return some(baz) if foo is None and bar is None else something_else(foo, bar)
df = spark.createDataFrame([[1, 2, 3], [None, None, 4], [3, 4, 5]], schema=['foo', 'bar', 'baz'])
df = df.withColumn("result", udf(my_udf)(array('foo', 'bar', 'baz')))
df.show()
But I feel it is not so optimal, case even if we don't need baz
for most of rows we still pass it to udf, I think it will prevent optimization of request.
Of cause I can apply different udfs for different columns one by one, but it seems also not so optimal.
So are there any way to replace values in both columns at once?