0

I have two parquet files, Parquet A has 137 columns and Parquet B has 110 columns. Parquet A file has the entire history of the table. So Parquet A has all the fields for the entire history of the table. Parquet B is all the values today I pull in and 17 columns were deleted. I want to union parquet A with parquet B but they don't have the same amount of columns. So that fails everytime.

I have tried mergeSchema but that fails. Is it possible to add the missing columns to parquet B and add nulls. Then make the union?

oharr
  • 163
  • 1
  • 3
  • 12
  • Possible duplicate of [How to perform union on two DataFrames with different amounts of columns in spark?](https://stackoverflow.com/questions/39758045/how-to-perform-union-on-two-dataframes-with-different-amounts-of-columns-in-spar) – Sai Sep 22 '18 at 03:32

1 Answers1

0

I would recommend you load both parquet files into Spark as dataframes, and use transformations to match the dataframes' schemas. From what you describe, it sounds like you want Parquet A (larger table) to be transformed so that it matches Parquet B's schema. The "drop" column function is straightforward way to accomplish this [docs].

Here's a sample I wrote where parquet A has 5 cols, and parquet B has 4 cols.

Showing the schemas of the two tables (dataframes): schema print

Dropping the extra column and creating union of two tables (dataframes): union

GuavaKhan
  • 187
  • 1
  • 9
  • Sorry no I want to add the columns from parquet A to parquet B. I don't want to get rid of any columns from parquet A. I am not pulling those fields any more, so with each new parquet B file will not have those columns on parquet A. I just want those columns to be null. I dont know what the columns are so Im looking for a way to compare the two parquet. if parquet B is missing the columns, add the columns and do a union. – oharr Sep 06 '18 at 18:39
  • Could you check in below link this is what you look for ? The post has answer for - how to perform union on two DataFrames with different amounts of columns in spark. thru pgm you just have to build all missing columns as nulls since you don't know what the missing columns are. https://stackoverflow.com/questions/39758045/how-to-perform-union-on-two-dataframes-with-different-amounts-of-columns-in-spar – Karthick Sep 06 '18 at 18:48