I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.
I need the array as an input for scipy.optimize.minimize
function.
I have tried both converting to Pandas and using collect()
, but these methods are very time consuming.
I am new to PySpark, If there is a faster and better approach to do this, Please help.
Thanks
This is how my dataframe looks like.
+----------+
|Adolescent|
+----------+
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
+----------+