Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?
- "id=200393/date=2019-03-25"
- "id=200393/date=2019-03-26"
- "id=200393/date=2019-03-27"
- "id=200393/date=2019-03-28"
- "id=200393/date=2019-03-29" and so on ...
Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)
Is there any better way than below ?
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")
spark.read.format("parquet").load(parquetFiles: _*)
The above code is working but I want to do something like below-
val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)