0

I have appended two dataframes to some s3 location and while reading it one of the columns in output df is missing. The same piece of code is working completely fine in spark-2.0 but in spark-2.4 column is missing.

There is nothing change in code.I'm just trying the same code in two different versions of spark.

Abhishek Mishra
  • 111
  • 1
  • 10
  • Maybe you can paste your sample code and error message. – howie Jun 11 '19 at 00:35
  • Hi there, it would be very helpful to post some input/output data as discussed here https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples – abiratsis Jun 11 '19 at 13:31
  • @howie df1.write.mode("overwrite").parquet("path/to/s3/1") df2.write.mode("append").parquet("path/to/s3/1") //both df1 and df2 may have different schema val df=spark.read.parquet("path/to/s3/1") df.printSchema //one of df1 columns are missing in final df Note:- I'm also applying some schema on both df1 depends on some condition like if df is empty then just passing empty df – Abhishek Mishra Jun 11 '19 at 15:13
  • @AlexandrosBiratsis what will be the output schema if one column is missing in df1 but it's present in df2? Will it be there in finaldf or not ? – Abhishek Mishra Jun 11 '19 at 15:23
  • it depends on the operation that you are executing. If the operation is join in one of them yes it will appear in the final dataset :) so try to post a minimal example similar to your code and provide some data. Then will be easier to get a proper answer – abiratsis Jun 11 '19 at 15:24
  • Thanks for your quick response @AlexandrosBiratsis How it's different from appending two dataframes to same s3 location ? – Abhishek Mishra Jun 11 '19 at 15:49
  • you cant use append with different schemas because spark doesnt know which schema to keep! so this is an invalid operation. Also is not a good idea to write different dataframes into the same path :) – abiratsis Jun 11 '19 at 16:10
  • @AlexandrosBiratsis Will spark-2.x drop a column which is empty from the df? I mean If I'm appending two dfs with the same schema but one with a particular column empty then spark will drop that empty column or not? – Abhishek Mishra Jun 11 '19 at 18:26
  • Not sure not. spark doesn't do these kind of magic – abiratsis Jun 11 '19 at 18:57
  • I had faced something similar in spark 2.0.4 version, specifically writing a df to HDFS; if the column had no values and it was not read as one of the original column when you read your input it was dropped while writing. hope that helps – Aaron Jun 11 '19 at 21:39
  • Thanks @Aaron I'm not sure but this may be the case. Anyway, I tried applying a default schema whenever df is empty. Till now everything looks fine. – Abhishek Mishra Jun 12 '19 at 14:41
  • @AlexandrosBiratsis Once done, will let you know the result. – Abhishek Mishra Jun 12 '19 at 14:42

0 Answers0