0

I'm trying to run a function that takes a dense vector and splits it into individual columns.

df contains 'ID' and 'feature' as columns. The code below transforms it into the form: ID, _2,_3, _4... where the _2, _3 are the columns that get created upon splitting the 'feature' column vector

def extract(row):
    return (row.ID, ) + tuple(float(x) for x in row.feature.values)
df = df.rdd.map(extract).toDF(["ID"])

This code fails when I execute it on the entire df that has nearly a million IDs.

But, if I take a sample of 100 rows and run the same code, it works perfectly. As far I as understand, this is a memory issue. What would be an efficient way to do this on the larger dataset? Any help would be appreciated. I'm using Spark 2.0+

Edit: Error Message: Spark Error Snapshot

Newest Edit: Data cleaning and preprocessing happens before df is created, so df has no nulls.

Additional Info: So, this link How to explode columns? has a scala based answer to my question. Thing is, can I implement this in pyspark?

Community
  • 1
  • 1
  • According to your error information, I think @Chobeat 's judgement could be right. Maybe you should check your data first, and do some operations on data cleaning. Any update for it? – Peter Pan Feb 09 '17 at 14:44
  • Is there a better way of achieving what I want? Basically need to split a vector with 'n' values into n columns. – Riju Bhattacharyya Feb 15 '17 at 06:19

1 Answers1

1

The relevant piece of the error is key not found: 3.0.

I'm 99.99% sure that it works on a sample because you have all valid inputs but in the whole dataset you may have some lines breaking the system. There may be different causes but checking the schema and content of row should help you investigate the issue.

Chobeat
  • 3,445
  • 6
  • 41
  • 59