I have figured out to do what I wanted. The idea is to create the schema for the nested column (struct) as follows:
from pyspark.sql.functions import lit, udf
from pyspark.sql.types import StringType, StructField, StructType
schema = StructType([
StructField('level2a',
StructType(
[
StructField('fielda', StringType(), nullable=False),
StructField('fieldb', StringType(), nullable=False),
StructField('fieldc', StringType(), nullable=False),
StructField('fieldd', StringType(), nullable=False),
StructField('fielde', StringType(), nullable=False),
StructField('fieldf', StringType(), nullable=False)
])
),
StructField('level2b',
StructType(
[
StructField('fielda', StringType(), nullable=False),
StructField('fieldb', StringType(), nullable=False),
StructField('fieldc', StringType(), nullable=False)
])
)
])
This can then be used in conjunction with a udf (which takes the above schema as a parameter) to get the desired result.
def make_meta(fielda, fieldb, fieldc, fieldd, fielde, fieldf, fieldalvl2, fieldblvl2, fieldclvl2):
return [
[fielda, fieldb, fieldc, fieldd, fielde, fieldf],
[fieldalvl2, fieldblvl2, fieldclvl2]
]
test_udf = udf(lambda fielda,
fieldb,
fieldc,
fieldd,
fieldf,
fielde,
fieldalvl2, fieldblvl2, fieldclvl2:
make_meta(fielda,
fieldb,
fieldc,
fieldd,
fieldf,
fielde, fieldalvl2, fieldblvl2, fieldclvl2),
schema)
df = spark.range(0, 5)
df.withColumn("test", test_udf(lit("a"), lit("b"), lit("c"),lit("d"),lit("e"),lit("f"),lit("a"),lit("b"),lit("c"))).printSchema()
Prints the following:
root
|-- id: long (nullable = false)
|-- test: struct (nullable = true)
| |-- level2a: struct (nullable = true)
| | |-- fielda: string (nullable = false)
| | |-- fieldb: string (nullable = false)
| | |-- fieldc: string (nullable = false)
| | |-- fieldd: string (nullable = false)
| | |-- fielde: string (nullable = false)
| | |-- fieldf: string (nullable = false)
| |-- level2b: struct (nullable = true)
| | |-- fielda: string (nullable = false)
| | |-- fieldb: string (nullable = false)
| | |-- fieldc: string (nullable = false)
In scala it is possible to return an instance of a case class from a udf, which was what I was trying to do in python (ie. return an object)