I have a pandas dataframe data_pandas
which has about half a million rows and 30000 columns. I want this to be in a Spark dataframe data_spark
and I achieve this by:
data_spark = sqlContext.createDataFrame(data_pandas)
I am working on an r3.8xlarge driver with 10 workers of the same configuration. But the aforementioned operation takes forever and returns an OOM error. Is there an alternate method I can try?
The source data in in HDF format, so I can't read it directly as a Spark dataframe.