0

I have the following code as shown below. I need to check if the column y.lc.eoouh.ci is present in the input source and populate the column only if present, else it should be NULL.(The key lc is also optional) The code below doesn't seem to work the way it is supposed to as even though y.lc.eoouch.ci is present in the input, it evaluates to NULL.

The has_column implementation is from here.

df = df_s_a \
            .withColumn("ceci", \
                udf(
                    lambda y : y.lc[-1].eoouh.ci \
                        if has_column(y, 'lc.eoouh.ci') \
                            else None, \
                    StringType()
                   )(col('eh') \
                   ) \
                ) \
            .select(                    
                col('ceci')
            )
df.show()

Sample input:

{
 eh: {
   lc: [
      eoouch: {
       ci: "1234ABC"
    }
  ]
 }
}
JohnWick
  • 63
  • 2
  • 8

1 Answers1

0

The df[something.path.somewhere] doesn't work. I'll have to investigate that option a bit.

I've managed to make it work like this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType


def has_column(df):
    try:
        df["lc"][0]["eoouch"]["ci"]
        return True
    except KeyError:
        return False


if __name__ == "__main__":

    spark = SparkSession.builder.getOrCreate()
    sc = spark.sparkContext
    data = [
        {"eh": {"lc": [{"eoouch": {"ci": "test"}}]}},
        {"eh": {"lc": [{"eoouch": {"as": "test"}}]}},
    ]

    df = spark.createDataFrame(data)
    add_column_udf = F.udf(
        lambda y: y if has_column(y) else None,
        StringType(),
    )
    df = df.withColumn("ceci", add_column_udf(F.col("eh")))

Result:

+----------------------------------+-------------------------+                  
|eh                                |ceci                     |
+----------------------------------+-------------------------+
|{lc -> [{eoouch -> {ci -> test}}]}|{lc=[{eoouch={ci=test}}]}|
|{lc -> [{eoouch -> {as -> test}}]}|null                     |
+----------------------------------+-------------------------+

It's not perfect since it's not a general solution for column name but it could be easily generalized since it works on a dict object.

vladsiv
  • 2,718
  • 1
  • 11
  • 21