I have a spark dataframe with 10 million records and 150 columns. I am attempting to convert it to a pandas DF.
x = df.toPandas()
# do some things to x
And it is failing with ordinal must be >= 1
. I am assuming this is because it is just to big to handle at once. Is it possible to chunk it and convert it to a pandas DF for each chunk?
Full stack:
ValueError Traceback (most recent call last)
<command-2054265283599157> in <module>()
158 from db.table where snapshot_year_month=201806""")
--> 159 ps = x.toPandas()
160 # ps[["pol_nbr",
161 # "pol_eff_dt",
/databricks/spark/python/pyspark/sql/dataframe.py in toPandas(self)
2029 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
2030 else:
-> 2031 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
2032
2033 dtype = {}
/databricks/spark/python/pyspark/sql/dataframe.py in collect(self)
480 with SCCallSiteSync(self._sc) as css:
481 port = self._jdf.collectToPython()
--> 482 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
483