3

I need to dynamically create multiple dataframes in pyspark based on the values available in the python list

My dataframe(df) has data:

date        gender balance
2018-01-01   M     100
2018-02-01   F     100
2018-03-01   M     100

my_list = [2018-01-01, 2018-02-01, 2018-03-01]
for i in my_list:
  df_i = df.select("*").filter("date=i").limit(1000)

Could you please help?

cph_sto
  • 7,189
  • 12
  • 42
  • 78
Mohan
  • 91
  • 1
  • 10
  • I want to split the existing pyspark dataframe into 3 dataframes just based on dates and hence i have created a list with distinct values and i tried filtering based using filter function by passing the date. – Mohan Feb 18 '19 at 09:07
  • Yes, I understood it now. Sorry, my bad. Mohan, `list` is a keyword and should not be used as a variable name. I made an edit in accordance. – cph_sto Feb 18 '19 at 09:08

1 Answers1

3

I am not sure if you can create the names of dataframes dynamically in PySpark. In Python, you cannot even dynamically assign the names of variables, let alone dataframes.

One way is to create a dictionary of the dataframes, where the key corresponds to each date and the value of that dictionary corresponds to the dataframe.

For Python: Refer to this link, where someone has asked a similar Q on name dynamism.

Here is a small PySpark implementation -

from pyspark.sql.functions import col
values = [('2018-01-01','M',100),('2018-02-01','F',100),('2018-03-01','M',100)]
df = sqlContext.createDataFrame(values,['date','gender','balance'])
df.show()
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-01-01|     M|    100|
|2018-02-01|     F|    100|
|2018-03-01|     M|    100|
+----------+------+-------+

# Creating a dictionary to store the dataframes.
# Key: It contains the date from my_list.
# Value: Contains the corresponding dataframe.
dictionary_df = {}  

my_list = ['2018-01-01', '2018-02-01', '2018-03-01']
for i in my_list:
    dictionary_df[i] = df.filter(col('date')==i)

for i in my_list:
    print('DF: '+i)
    dictionary_df[i].show() 

DF: 2018-01-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-01-01|     M|    100|
+----------+------+-------+

DF: 2018-02-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-02-01|     F|    100|
+----------+------+-------+

DF: 2018-03-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-03-01|     M|    100|
+----------+------+-------+

print(dictionary_df)
    {'2018-01-01': DataFrame[date: string, gender: string, balance: bigint], '2018-02-01': DataFrame[date: string, gender: string, balance: bigint], '2018-03-01': DataFrame[date: string, gender: string, balance: bigint]}
cph_sto
  • 7,189
  • 12
  • 42
  • 78