How to union multiple csv files in to single csv file

Question

I am writing the below code to convert the union of multiple CSV files and writing the combined data into new file. But I am facing an error.

val filesData=List("file1", "file2")
val dataframes = filesData.map(spark.read.option("header", true).csv(_))

val combined = dataframes.reduce(_ union _)
val data = combined.rdd

val head :Array[String]= data.first()

val memberDataRDD = data.filter(_(0) != head(0))

type mismatch; found : org.apache.spark.sql.Row required: Array[String]

`head` will not be `Array[String]`. `combined.rdd` will return a RDD of type `RDD[Row]` as stated in the error message. — philantrovert, Jan 12 '18 at 07:24
https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load — bob, Jan 12 '18 at 07:25
The error message is clear or not? A dataframe is an RDD of rows -> `RDD[Row]`, you have to get the string values from the Row object if you want this. — UninformedUser, Jan 12 '18 at 10:51

score 0 · Answer 1 · answered Oct 12 '18 at 07:19

there will not be any issue as long as both csv df have same schema

val df = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp") val df1 = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp1")

val dfs = List(df, df1) val dfUnion = dfs.reduce(_ union _)

score -1 · Answer 2 · answered Jan 12 '18 at 07:34

-1

You can just read multiple paths directly with Spark:

spark.read.option("header", true).csv(filesData:_*)

answered Jan 12 '18 at 07:34

manojlds

290,304
63
469
417

True but that's unrelated to the actual problem OP is having. – Jasper-M Jan 12 '18 at 10:10
@Jasper-M - how? OP wants to read multiple csv files with spark. The above is the solution. That's why SO requires users to provide code and details. Also read this - https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem – manojlds Jan 13 '18 at 07:58
Exactly! He's asking about reading multiple CSV files, but the code and error he provides show that his actual problem is something else. – Jasper-M Jan 13 '18 at 09:32
@Jasper-M - no. The OP is trying to do it by creating one df per file and is stuck at doing the union. The first part of creating multiple dfs itself is not needed, so the problem does not occur. The question of how to union multiple dfs when they do exist for a valid need can be a different question. – manojlds Jan 15 '18 at 03:34

How to union multiple csv files in to single csv file

2 Answers2