Creating multiple dataframes using loop and filtering in pyspark

Question

I have a df which looks like this :

Now I want to seperate out the 3 stores in 3 seperate dfs. For this i created a list of store names

store_list = df.select("StoreName").distinct().rdd.flatMap(lambda x:x).collect()

Now I want to iterate through this list and filter out different stores in diff dfs.

for i in store_list:
    df_{i} = df.where(col("storeName") == i)

The code has syntax errors obviously, but thats the approach I am thinking. I want to avoid Pandas as the datasets are huge.

Can anyone help me with this?

Thanks

you are almost there , you can create a dictionary and refer the keys later for accessing the dataframe: `d = {f"df_{i}": df.where(col("storeName") == i) for i in store_list}` and then `d['df_ABC']` would give you a sub df with storeName='ABC', Related: https://stackoverflow.com/questions/1373164/how-do-i-create-variable-variables — anky, Aug 20 '21 at 10:11
Just a quick question, if i do the dictionary approach, will it mean my df operations will happen in driver memory & i will loose my spark df parallelization ? — Bitanshu Das, Aug 20 '21 at 10:22
I havent tried this scenario, however , the dataframes saved in each key of a dictionary, is just a lazily evaluated dataframe, any action will be taken when you call the keys of the dictionary. I dont think that parallelization will be lost, but you can try for a small/medium usecase and see — anky, Aug 20 '21 at 10:35
@BitanshuDas, I am trying to solve the same problem, which approach worked for you. — Ahmad Sayeed, Jul 06 '22 at 07:56

0 Answers0