1

Consider the following scenario:
Incremental data gets ingested daily into a HDFS location, and from there I have to read the data using pyspark and find out the latest/active records. Also, I have to handle schema changes in the data, as new fields may get added.

How can I achieve schema comparison and handle schema changes in pyspark?
How can I handle data which got loaded before the schema changes?

Is the below approach is a good one?

  1. Generate a script to create hive tables on top of HDFS location.
  2. Then compare the schema of source table and Hive table using pyspark. If there is schema change use the new schema from source to create the new ddl for table creation. Drop the existing table and create the table with new schema.
  3. Create a view from the hive tables to get the latest records using primary key and audit column.
Sukhi
  • 13,261
  • 7
  • 36
  • 53
Amrutha K
  • 204
  • 1
  • 3
  • 13
  • 1
    Big topic and for people not conversant very simple, but in practice a few things to watch out for. See,for example: https://stackoverflow.com/questions/37644664/schema-evolution-in-parquet-format – thebluephantom Jul 15 '19 at 20:27
  • So if I understood correctly you don't need to store historical data but just to merge the old data with the newest one and sore the updated schema? – abiratsis Jul 22 '19 at 11:01
  • for instance you don't have any requirement to store historical data? – abiratsis Jul 22 '19 at 11:02
  • Hi @Amrutha could check the suggested solution? – abiratsis Jul 29 '19 at 10:22
  • Hi @AlexandrosBiratsis , In my case we are storing the historical data. And as of schema changes there will be only column additions at the end in the source DB. If there is schema change then i have to get the latest schema and add to the hive tables. – Amrutha K Aug 02 '19 at 04:47

0 Answers0