Consider the following schema:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: struct (nullable = true)
| |-- d1: struct (nullable = true)
| | |-- timestamp: string (nullable = true)
| | |-- timeZoneType: string (nullable = true)
| | |-- zoneName: string (nullable = true)
| |-- d2: string (nullable = true)
| |-- d3: string (nullable = true)
|-- E: array (nullable = true)
| |-- e1: struct (nullable = true)
| | |-- transactionId: string (nullable = true)
| | |-- timeStamp: string (nullable = true)
| | |-- instanceId: string (nullable = true)
| | |-- userId: string (nullable = true)
| | |-- reason: string (nullable = true)
| |-- e2: array (nullable = true)
| | |-- transactionId: string (nullable = true)
| | |-- timeStamp: string (nullable = true)
| | |-- instanceId: string (nullable = true)
| | |-- userId: string (nullable = true)
| | |-- reason: string (nullable = true)
| | |-- additionalData: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
How do I remove a set of column values from a DataFrame without dropping it from the schema in PySpark? This is different from dropping specific columns from the entire schema.
Assume the columns to keep are in the list keepColumns
. I would like to replace the entries of all other columns with NULL
while leaving the entries of keepColumns
intact.
E.g,
keepColumns = ["C",
"D.d1.zoneName",
"E.e1.reason",
"E.e2.timeStamp"]
Note the nested Array and Struct fields. I couldn't even use select
on a sub-field of an ArrayType unless I use an index like select E.e2[0].timeStamp from table1
(after applying df.createOrReplaceTempView("table1")
).
Following the most-voted solution given in this post does not work either. It merely shows no change to the existing values.