Can I use loops in Spark Data frame

Question

My data is as shown below

Store ID  Amount,...
1     1  10 
1     2  20
2     1  10
3     4  50

I have to create separate directory for each store

Store 1/accounts
ID  Amount
1   10
2   20

store 2/accounts directory:
ID Amount
1   10

For this purpose Can I use loops in Spark dataframe. It is working in local machine. Will it be a problem in cluster

while storecount<=50:
    query ="SELECT * FROM Sales where Store={}".format(storecount)
    DF =spark.sql(query)
    DF.write.format("csv").save(path)
    count = count +1

You could adapt this solution for your needs: https://stackoverflow.com/questions/30338213/writing-rdd-partitions-to-individual-parquet-files-in-its-own-directory/32835922#32835922 — Egor4eg, Sep 19 '17 at 15:13

score 4 · Accepted Answer · answered Sep 19 '17 at 15:14

4

If I correctly understood the problem , you really want to do is partitioning in the dataframe.

I would suggest to do this

df.write.partitionBy("Store").mode(SaveMode.Append).csv("..")

This will write the dataframe into several partitions like

store=2/
store=1/
....

answered Sep 19 '17 at 15:14

Avishek Bhattacharya

6,534
3
34
53

You are correct.Partition by solved my problem. Thank you, Avishek – ranjith reddy Sep 19 '17 at 15:22
is there any way I can control the number of part files formed within a partition? – ranjith reddy Sep 20 '17 at 02:33
Check this question https://stackoverflow.com/questions/44808415/spark-parquet-partitioning-large-number-of-files – Avishek Bhattacharya Sep 20 '17 at 03:22

score 0 · Answer 2 · answered Sep 19 '17 at 15:09

0

Yes you can run a loop here as it is not an Nested operation on data frame. Nested operation on RDD or Data frame is not allowed as Spark Context is not Serializable.

answered Sep 19 '17 at 15:09

hagarwal

1,153
11
27

Can I use loops in Spark Data frame

2 Answers2