Build dataframe name dynamically from the config - PySpark

Question

I need to build the final dataframe name dynamically based on the config (join final_df and suffix).When I run the code mentioned at the end, I get the error - "SyntaxError: can't assign to operator". However If I replace each["final_df"]+'_'+ each["suffix"] with any other name, it works.

Data :

df_source_1 = spark.createDataFrame(
        [
          (123,10),
          (123,15),
          (123,20)
        ],
        ("cust_id", "value")
    )

Config:

config = """
                [ 
                  {
                      "source_df":"df_source_1",
                      "suffix": "new", 
                      "group":["cust_id"],
                      "final_df": "df_taregt_1"
                  }
                ]
                """

Code:

import json   
for each in json.loads(config):
    print("Before=",each['final_df'] ) # str object
    print(each["final_df"]+'_'+ each["suffix"]) # df_taregt_1_new , print statement works
    each["final_df"]+'_'+ each["suffix"] = eval(each["source_df"]).groupBy(each["group"]).agg(sum("value")) # Errors out. Here I need to assign the dataframe to df_taregt_1_new

Could any one help.

That's a terrible implementation you're trying to do. You should probably explain why you want to do that ... using a dict with key as the old or new name would be way better. [Why is using 'eval' a bad practice?](https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice) — Steven, Aug 31 '21 at 12:21
FYI, `each["final_df"]+'_'+ each["suffix"]` is a string, it cannot be assigned. That's why you got the error. — Steven, Aug 31 '21 at 12:24
posted oversimplified usecase. In real case, for the same data source/group, I have different operation. ie min and max. So I wanted to create two dataframes, with names created dynamically based on operation, so that at the end I can combine both the dataframes into one ie one mentioned in "final_df". Config would look like : { "source_df":"df_source_1", "operation": { "min": {}, "max" : {}}, "group":["cust_id"], "final_df": "df_taregt_1" } — Matthew, Aug 31 '21 at 12:31
@Steven , also in the cases where we would need to source the dataframe names from config, what alternatives do we have other than eval? — Matthew, Aug 31 '21 at 12:39
Use a dict ... That's so simpler. No need to create dynamic variables, no need to use eval. — Steven, Aug 31 '21 at 12:46

Steven · Answer 1 · 2021-08-31T16:04:36.607

You code with a dict :

df_dict = {}
df_dict["df_source_1"] = spark.createDataFrame(
    [(123, 10), (123, 15), (123, 20)], ("cust_id", "value")
)

for each in json.loads(config):
    df_dict[each["final_df"] + "_" + each["suffix"]] = (
        df_dict[each["source_df"]].groupBy(each["group"]).agg(sum("value"))
    )

Instead of working with object that are supposedly created dynamically, you have a dict that stores all these objects with their dynamic names. You can even test your dict to know if an object exists or not.

Build dataframe name dynamically from the config - PySpark

1 Answers1