I'm using scala spark and have a DataFrame:
Source | Column1 | Column2
A ... ...
B ... ...
B ... ...
C ... ...
B ... ...
C ... ...
A ... ...
I was looking into partitionBy (https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/DataFrameWriter.html) but I have a specific requirement where I have to save each partition to a separate directory. Ideally, it would look like this:
df.write.partitionBy("source").saveAsTable($"{CURRENT_SOURCE_VALUE}")
Is it possible to accomplish this using partitionBy
or should try doing something else like looping over each row using rdd
, or possibly groupBy
? etc. any pointers would be helpful I'm fairly new to apache spark. Something like this (https://stackoverflow.com/a/43998102) but I don't think it's possible in Apache Spark Scala.
EDIT
The location (path) for each source will come from a separate map like so:
var sourceLocation = Map[String, String]
//where the key is the source name (A, B, C) and the value is the path (/MyCustomPathForA/.../, /MyCustomPathForB/.../, etc.
) where each base path (root) could be different.