1

My data is as shown below

Store ID  Amount,...
1     1  10 
1     2  20
2     1  10
3     4  50

I have to create separate directory for each store

Store 1/accounts
ID  Amount
1   10
2   20

store 2/accounts directory:
ID Amount
1   10 

For this purpose Can I use loops in Spark dataframe. It is working in local machine. Will it be a problem in cluster

while storecount<=50:
    query ="SELECT * FROM Sales where Store={}".format(storecount)
    DF =spark.sql(query)
    DF.write.format("csv").save(path)
    count = count +1
ranjith reddy
  • 481
  • 2
  • 8
  • 19
  • You could adapt this solution for your needs: https://stackoverflow.com/questions/30338213/writing-rdd-partitions-to-individual-parquet-files-in-its-own-directory/32835922#32835922 – Egor4eg Sep 19 '17 at 15:13

2 Answers2

4

If I correctly understood the problem , you really want to do is partitioning in the dataframe.

I would suggest to do this

df.write.partitionBy("Store").mode(SaveMode.Append).csv("..")

This will write the dataframe into several partitions like

store=2/
store=1/
....
Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
0

Yes you can run a loop here as it is not an Nested operation on data frame. Nested operation on RDD or Data frame is not allowed as Spark Context is not Serializable.

hagarwal
  • 1,153
  • 11
  • 27