I'm trying to run a function that takes a dense vector and splits it into individual columns.
df contains 'ID' and 'feature' as columns. The code below transforms it into the form: ID, _2,_3, _4... where the _2, _3 are the columns that get created upon splitting the 'feature' column vector
def extract(row):
return (row.ID, ) + tuple(float(x) for x in row.feature.values)
df = df.rdd.map(extract).toDF(["ID"])
This code fails when I execute it on the entire df that has nearly a million IDs.
But, if I take a sample of 100 rows and run the same code, it works perfectly. As far I as understand, this is a memory issue. What would be an efficient way to do this on the larger dataset? Any help would be appreciated. I'm using Spark 2.0+
Edit: Error Message: Spark Error Snapshot
Newest Edit: Data cleaning and preprocessing happens before df is created, so df has no nulls.
Additional Info: So, this link How to explode columns? has a scala based answer to my question. Thing is, can I implement this in pyspark?