How to handle schema changes in Spark?

Question

Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records. Also, I have to handle schema changes in the data, as new fields may get added.

How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?

Is the below approach is a good one?

Generate a script to create hive tables on top of HDFS location.
Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
Create a view from the hive tables to get the latest records using primary key and audit column.

Big topic and for people not conversant very simple, but in practice a few things to watch out for. See,for example: https://stackoverflow.com/questions/37644664/schema-evolution-in-parquet-format — thebluephantom, Jul 15 '19 at 20:27
So if I understood correctly you don't need to store historical data but just to merge the old data with the newest one and sore the updated schema? — abiratsis, Jul 22 '19 at 11:01
for instance you don't have any requirement to store historical data? — abiratsis, Jul 22 '19 at 11:02
Hi @AlexandrosBiratsis , In my case we are storing the historical data. And as of schema changes there will be only column additions at the end in the source DB. If there is schema change then i have to get the latest schema and add to the hive tables. — Amrutha K, Aug 02 '19 at 04:47

How to handle schema changes in Spark?

0 Answers0