I'm trying to add a field to Row() in PySpark, calling first Row().toDict() so I can add more fields (as Row is a tuple and inmutable). I save the schema to add new fields. The base code example is something like this. I removed the field add/remove code because the error is there even without field manipulation:
row_list = dataframe.collect()
row_result = []
for row in row_list:
# row example: Row(b="hello", a=datetime.datetime(...), c=1289)
row = Row(**row.toDict())
# resulting row: Row(a=datetime.datetime(...), b="hello", c=1289)
row_result.append(row)
spark.createDataFrame(row_result, schema=dataframe.schema)
# this fails!
# spark tries to convert 'hello' to TimestampType
Basically, the toDict() is the problematic function here. It takes a Row and returns a dictionary with ordered fields, so when I generate a new Row(**dict), it will have different field order. Even the schema is defined by field names, it tries to assign types by position, so if my old field #3 was a TimestampField and the new ordered row has a StringType, it will fail. I'm on Spark 2.1 and Python 3.6. Thanks