0

Pandas API function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_string.html

Another answer, though it doesn't work for me pyspark : Convert DataFrame to RDD[string]

Following above post advice, I tried going with

data.rdd.map(lambda row: [str(c) for c in row])

Then I get this error

TypeError: 'PipelinedRDD' object is not iterable

I would like for it to output rows of strings as if it's similar to to_string() above. Is this possible?

1 Answers1

0

Would pyspark.sql.DataFrame.show satisfy your expectations about the console output? You can sort the df via pyspark.sql.DataFrame.sort before printing if required.

helios
  • 66
  • 2
  • Thanks, but that isn't what I'm looking for. – user10443249 Dec 19 '22 at 07:39
  • There seems to be no direct method on pyspark SQL dataframes to extract the string output format you are looking for. However, a simple `print(data.pandas_api().to_string())` does the job (in your desired format), since pandas-on-Spark DataFrames and Spark DataFrames are virtually interchangeable (see the [official documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/pandas_pyspark.html). I see no benefit in trying to somehow convert the data to low-level rdd's instead of sticking to high-level API's. – helios Dec 19 '22 at 19:35
  • Does this solve your issue @user10443249 ? If so, please mark the answer as useful – helios Dec 20 '22 at 19:50