Append multiple CSV files with different order of headers

Question

I have a directory that contains CSV files that have the same columns but not in the same order. I would like to append them in one CSV file but when do that with pyspark using the following code I get the csv but with mixed data inside (i.e. it it is not sorting out the order of the columns correctly).

from pyspark import SparkContext

from pyspark.sql import SQLContext

from pyspark.sql.functions import col


sc = SparkContext("local", "Simple App")

sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/myPATH/TO_THE_CSV_FILES/')
df.coalesce(1).write.option("header", "true").format('com.databricks.spark.csv').save('/myPATH/TO_APPENDED_CSV_FILE/')

score 1 · Answer 1 · edited Feb 21 '18 at 18:46

1

You can use a little trick.

cols = a.columns

a = a.select(cols)
b = b.select(cols)

c = a.union(b)

edited Feb 21 '18 at 18:46

David

11,245
3
41
46

answered Feb 21 '18 at 18:44

Frank

317
2
11

Was Frank's answer, I just edited it. If you're going to do `df.coalesce(1)`, you might as well not use spark. With python/pandas, you could list all files in the directory (https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory), use pandas to create a df for each file, and concatenate all the dfs (https://stackoverflow.com/questions/21435176/appending-two-dataframes-with-same-columns-different-order) – David Feb 21 '18 at 19:53
Maybe I wasn't clear in my description but the files are stored in HDFS and cannot be accessed directly using pandas, can I ? – deltascience Feb 22 '18 at 08:46

Append multiple CSV files with different order of headers

1 Answers1