I am trying to figure out how to dynamically create columns for each item in list(cp_codeset list in this case) by using withColumn() function and calling udf in the withColumn() function in the pySpark. Below is the code that I wrote but it is giving me an error.
from pyspark.sql.functions import udf, col, lit
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
codeset = set(cp_codeset['CODE'])
for col_name in cp_codeset.col_names.unique():
def flag(d):
if (d in codeset):
name = cp_codeset[cp_codeset['CODES']==d].col_names
if(name==col_name):
return 1
else:
return 0
cpf_udf = udf(flag, IntegerType())
p.withColumn(col_name, cpf_udf(p.codes)).show()
The other option is to manually do it but in that case I have to write same udf function and call it with withColumn() function 75 times (which is the size of cp_codeset["col_names"])
Below is my two dataframes and I am trying to get how result appears
P (This is Pyspark dataframe and this dataframe is too large for pandas to handle)
id|codes
1|100
2|102
3|104
cp_codeset (pandas dataframe)
codes| col_names
100|a
101|b
102|c
103|d
104|e
105|f
result (pyspark dataframe)
id|codes|a|c|e
1|100 |1|0|0
2|102 |0|1|0
3|104 |0|0|1