0

I have a directory that contains CSV files that have the same columns but not in the same order. I would like to append them in one CSV file but when do that with pyspark using the following code I get the csv but with mixed data inside (i.e. it it is not sorting out the order of the columns correctly).

from pyspark import SparkContext

from pyspark.sql import SQLContext

from pyspark.sql.functions import col


sc = SparkContext("local", "Simple App")

sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/myPATH/TO_THE_CSV_FILES/')
df.coalesce(1).write.option("header", "true").format('com.databricks.spark.csv').save('/myPATH/TO_APPENDED_CSV_FILE/')
deltascience
  • 3,321
  • 5
  • 42
  • 71

1 Answers1

1

You can use a little trick.

cols = a.columns

a = a.select(cols)
b = b.select(cols)

c = a.union(b)
David
  • 11,245
  • 3
  • 41
  • 46
Frank
  • 317
  • 2
  • 11
  • Was Frank's answer, I just edited it. If you're going to do `df.coalesce(1)`, you might as well not use spark. With python/pandas, you could list all files in the directory (https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory), use pandas to create a df for each file, and concatenate all the dfs (https://stackoverflow.com/questions/21435176/appending-two-dataframes-with-same-columns-different-order) – David Feb 21 '18 at 19:53
  • Maybe I wasn't clear in my description but the files are stored in HDFS and cannot be accessed directly using pandas, can I ? – deltascience Feb 22 '18 at 08:46