Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records.
Also, I have to handle schema changes in the data, as new fields may get added.
How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?
Is the below approach is a good one?
- Generate a script to create hive tables on top of HDFS location.
- Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
- Create a view from the hive tables to get the latest records using primary key and audit column.