Combine ‘n’ data files to make a single Spark Dataframe

Question

I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?

P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.

score 0 · Answer 1 · answered Nov 02 '18 at 09:00

There is another approach other than the solutions mentioned in first two comments.

Read all CSV files to a single RDD producing RDD[String].

Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.

Create dataFrame schema.

Create DataFrame from RDD[Row] using created Schema.

This may not be a good approach if the CSVs has large number of columns. Hope this helps

Combine ‘n’ data files to make a single Spark Dataframe

1 Answers1