1

I have a small df that consists of two columns with a description and a value:

 description|               value|
+--------------+--------------------+
|   PED_tobacco|                 0.4|
|PED_nontobacco|                1.49|
|           GMI|    17590.8855333196|
|       CMO_NGP|             53389.0|
|             A|                80.3|
|         SC_TT|              -0.146|
|        SC_THP|              -0.056|
|       SC_ENDS|              -0.007|
|      SC_CF_PD|              -0.002|
|      SC_CF_FF|              -0.031|
|      CO2_comb|             1.23E-6|
|   CO2_lighter|2.083000000000000...|
|   Carbon_Cost|               114.0|
|     PR_SDG12A|               -0.05|
|     PR_SDG12B|               -0.01|
|       PR_SDG3|                 0.0|
|      PR_SDG14|               -0.27|
|EDEVICE_SDG12A|               -0.01|
|EDEVICE_SDG12B|               -0.05|
|  EDEVICE_SDG3|               -0.01|
+--------------+--------------------+

I have been trying to find a way to convert each row, in an independent defined variable, so that I can reference it directly. For example, I want to be able to say PED_tobacco * 10, and get back 40.

I tried converting it into a list of dictionaries (at least that's how I can explain it with my python background), using:

ass_dict = df_assumptions \
    .rdd \
    .map(lambda row: {row[0]: row[1]}) \
    .collect()

# Which prints:
{'PED_tobacco': 0.4}, {'PED_nontobacco': 1.49}, {'GMI': 17590.8855333196}, {'CMO_NGP': 53389.0}, {'A': 80.3}, {'SC_TT': -0.146}, {'SC_THP': -0.056}, {'SC_ENDS': -0.007}, {'SC_CF_PD': -0.002}, {'SC_CF_FF': -0.031}, {'CO2_comb': 1.23e-06}, {'CO2_lighter': 2.0830000000000002e-08}, {'Carbon_Cost': 114.0}, {'PR_SDG12A': -0.05}, {'PR_SDG12B': -0.01}, {'PR_SDG3': 0.0}, {'PR_SDG14': -0.27}, {'EDEVICE_SDG12A': -0.01}, {'EDEVICE_SDG12B': -0.05}, {'EDEVICE_SDG3': -0.01}, {'EDEVICE_SDG14': 0.0}, {'TL_GL': 1.0}, {'TL_GR': 0.0}, {'EW_GL': 0.83}]

But I still can't access each variable independently them. In python I do this using:

def convert_to_var(df):
    desc = []
    val = []  
    
    for i,row in df.iterrows():
        desc.append(i)
        val.append(row) 
        
    return dict(val)

val_dict = convert_to_var(IA)
globals().update(val_dict)

Is there a way to do the same in Spark? How can I get each description with it's a value as a separate variable to be called on directly? Thanks in advance.

mck
  • 40,932
  • 13
  • 35
  • 50
sophocles
  • 13,593
  • 3
  • 14
  • 33

1 Answers1

2

You can combine the list of dictionaries collected. It's a really bad idea to have variable variables though. It's better to directly use the dictionary instead, which you will have made from the code.

dict_list = df_assumptions \
    .rdd \
    .map(lambda row: {row[0]: row[1]}) \
    .collect()

val_dict = {k: v for d in dict_list for (k, v) in d.items()}
globals().update(val_dict)

# Or you can do
for d in dict_list:
    globals().update(d)
mck
  • 40,932
  • 13
  • 35
  • 50
  • Thanks for your answer. It works. A couple of more points you can maybe address. Why is it a bad idea to have variables defined? Also, if I follow the better approach, which is to use the dictionary directly, how can I say ```PED_tobacco * 10``` directly from the dictionary? – sophocles Feb 18 '21 at 17:31
  • @sophocles to use the dict directly, you can do `val_dict['PED_tobacco'] * 10`. For the reason why it's a bad idea, you can see the comments on [this post](https://stackoverflow.com/questions/1373164/how-do-i-create-variable-variables) – mck Feb 18 '21 at 17:33
  • Thanks I will read the link you posted, and ```val_dict['PED_tobacco] * 10```, works well. – sophocles Feb 18 '21 at 17:37