0

Consider the following schema:

root
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)
 |-- C: string (nullable = true)
 |-- D: struct (nullable = true)
 |    |-- d1: struct (nullable = true)
 |    |    |-- timestamp: string (nullable = true)
 |    |    |-- timeZoneType: string (nullable = true)
 |    |    |-- zoneName: string (nullable = true)
 |    |-- d2: string (nullable = true)
 |    |-- d3: string (nullable = true)
 |-- E: array (nullable = true)
 |    |-- e1: struct (nullable = true)
 |    |    |-- transactionId: string (nullable = true)
 |    |    |-- timeStamp: string (nullable = true)
 |    |    |-- instanceId: string (nullable = true)
 |    |    |-- userId: string (nullable = true)
 |    |    |-- reason: string (nullable = true)
 |    |-- e2: array (nullable = true)
 |    |    |-- transactionId: string (nullable = true)
 |    |    |-- timeStamp: string (nullable = true)
 |    |    |-- instanceId: string (nullable = true)
 |    |    |-- userId: string (nullable = true)
 |    |    |-- reason: string (nullable = true)
 |    |    |-- additionalData: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

How do I remove a set of column values from a DataFrame without dropping it from the schema in PySpark? This is different from dropping specific columns from the entire schema.

Assume the columns to keep are in the list keepColumns. I would like to replace the entries of all other columns with NULL while leaving the entries of keepColumns intact.

E.g,

keepColumns = ["C",
               "D.d1.zoneName",
               "E.e1.reason",
               "E.e2.timeStamp"]

Note the nested Array and Struct fields. I couldn't even use select on a sub-field of an ArrayType unless I use an index like select E.e2[0].timeStamp from table1 (after applying df.createOrReplaceTempView("table1")).

Following the most-voted solution given in this post does not work either. It merely shows no change to the existing values.

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97

1 Answers1

0

I had the same issue with nested struct fields where I wanted them to be of StringType, but be filled with nulls. I couldn't get it to preserve the type without using an empty string at first.

Here is what worked for me, using a UDF against an empty string so Spark still infers StringType (modify your UDF slightly):

    def nullify(col):
        return F.when(col == '', F.lit(None)).otherwise(col)


    # Does not work
    >>> df.select(F.struct(F.lit(None).alias('test'))).printSchema()
    root
     |-- named_struct(test, NULL AS `test`): struct (nullable = false)
     |    |-- test: null (nullable = true)

    # Works!
    >>> df.select(F.struct(nullify(F.lit('')).alias('test'))).printSchema()
    root
     |-- named_struct(test, nullify() AS `test`): struct (nullable = false)
     |    |-- test: string (nullable = true)

Just note that I create my structs on the fly, so I apply this as I'm creating them. Different story if you've already read in a struct - you'd have to flatten it and rebuild it again in that case.

jastang
  • 386
  • 5
  • 14