0

I'm trying to add a field to Row() in PySpark, calling first Row().toDict() so I can add more fields (as Row is a tuple and inmutable). I save the schema to add new fields. The base code example is something like this. I removed the field add/remove code because the error is there even without field manipulation:

row_list = dataframe.collect()
row_result = []

for row in row_list:
  # row example: Row(b="hello", a=datetime.datetime(...), c=1289)
  row = Row(**row.toDict())
  # resulting row: Row(a=datetime.datetime(...), b="hello", c=1289)
  row_result.append(row)


spark.createDataFrame(row_result, schema=dataframe.schema) 
# this fails!
# spark tries to convert 'hello' to TimestampType

Basically, the toDict() is the problematic function here. It takes a Row and returns a dictionary with ordered fields, so when I generate a new Row(**dict), it will have different field order. Even the schema is defined by field names, it tries to assign types by position, so if my old field #3 was a TimestampField and the new ordered row has a StringType, it will fail. I'm on Spark 2.1 and Python 3.6. Thanks

midnight1247
  • 356
  • 4
  • 17
  • the elements in a `Row` are sorted alphabetically. Field order is probably not the source of your trouble. Please [edit] your question to include a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples), along with what error you are getting and what the desired output is. – pault Feb 08 '19 at 15:13
  • Also take a look at [this question](https://stackoverflow.com/q/54484067/5858851). – pault Feb 08 '19 at 15:14

0 Answers0