-3

I am writing the below code to convert the union of multiple CSV files and writing the combined data into new file. But I am facing an error.

val filesData=List("file1", "file2")
val dataframes = filesData.map(spark.read.option("header", true).csv(_))

val combined = dataframes.reduce(_ union _)
val data = combined.rdd

val head :Array[String]= data.first()

val memberDataRDD = data.filter(_(0) != head(0))

type mismatch; found : org.apache.spark.sql.Row required: Array[String]

manojlds
  • 290,304
  • 63
  • 469
  • 417
lak
  • 143
  • 1
  • 2
  • 13
  • `head` will not be `Array[String]`. `combined.rdd` will return a RDD of type `RDD[Row]` as stated in the error message. – philantrovert Jan 12 '18 at 07:24
  • https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load – bob Jan 12 '18 at 07:25
  • The error message is clear or not? A dataframe is an RDD of rows -> `RDD[Row]`, you have to get the string values from the Row object if you want this. – UninformedUser Jan 12 '18 at 10:51

2 Answers2

0

there will not be any issue as long as both csv df have same schema

val df = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp") val df1 = spark.read.option("header", "true").csv("C:\maheswara\learning\big data\spark\sample_data\tmp1")

val dfs = List(df, df1) val dfUnion = dfs.reduce(_ union _)

mputha
  • 395
  • 5
  • 7
-1

You can just read multiple paths directly with Spark:

spark.read.option("header", true).csv(filesData:_*)
manojlds
  • 290,304
  • 63
  • 469
  • 417
  • True but that's unrelated to the actual problem OP is having. – Jasper-M Jan 12 '18 at 10:10
  • @Jasper-M - how? OP wants to read multiple csv files with spark. The above is the solution. That's why SO requires users to provide code and details. Also read this - https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem – manojlds Jan 13 '18 at 07:58
  • Exactly! He's asking about reading multiple CSV files, but the code and error he provides show that his actual problem is something else. – Jasper-M Jan 13 '18 at 09:32
  • @Jasper-M - no. The OP is trying to do it by creating one df per file and is stuck at doing the union. The first part of creating multiple dfs itself is not needed, so the problem does not occur. The question of how to union multiple dfs when they do exist for a valid need can be a different question. – manojlds Jan 15 '18 at 03:34