0

I have a df which looks like this :

CustomerID CustomerName StoreName
101 Mike ABC
102 Sarah ABC
103 Alice ABC
104 Michael PQR
105 Abhi PQR
106 Bill XYZ
107 Roody XYZ

Now I want to seperate out the 3 stores in 3 seperate dfs. For this i created a list of store names

store_list = df.select("StoreName").distinct().rdd.flatMap(lambda x:x).collect()

Now I want to iterate through this list and filter out different stores in diff dfs.

for i in store_list:
    df_{i} = df.where(col("storeName") == i)

The code has syntax errors obviously, but thats the approach I am thinking. I want to avoid Pandas as the datasets are huge.

Can anyone help me with this?

Thanks

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Bitanshu Das
  • 627
  • 2
  • 8
  • 21
  • you are almost there , you can create a dictionary and refer the keys later for accessing the dataframe: `d = {f"df_{i}": df.where(col("storeName") == i) for i in store_list}` and then `d['df_ABC']` would give you a sub df with storeName='ABC', Related: https://stackoverflow.com/questions/1373164/how-do-i-create-variable-variables – anky Aug 20 '21 at 10:11
  • Just a quick question, if i do the dictionary approach, will it mean my df operations will happen in driver memory & i will loose my spark df parallelization ? – Bitanshu Das Aug 20 '21 at 10:22
  • I havent tried this scenario, however , the dataframes saved in each key of a dictionary, is just a lazily evaluated dataframe, any action will be taken when you call the keys of the dictionary. I dont think that parallelization will be lost, but you can try for a small/medium usecase and see – anky Aug 20 '21 at 10:35
  • @BitanshuDas, I am trying to solve the same problem, which approach worked for you. – Ahmad Sayeed Jul 06 '22 at 07:56

0 Answers0