How do I print variables or dataframes to console in my PySpark program?

Question

I'm new to Spark, trying to use it like I have used Pandas for data analysis.

In pandas, to see a variable, I will write the following:

import pandas as pd

df = pd.DataFrame({a:[1,2,3],b:[4,5,6]})
print(df.head())

In Spark, my print statements are not printed to the terminal. Based on David's comment on this answer, print statements are sent to stdout/stderr, and there is a way to get it with Yarn, but he doesn't say how. I can't find anything that makes sense by Googling "how to capture stdout spark".

What I want is a way to see bits of my data to troubleshoot my data analysis. "Did adding that column work?" That sort of thing. I'd also welcome new ways to troubleshoot that are better for huge datasets.

score 2 · Accepted Answer · answered Nov 06 '19 at 07:08

Yes, you can use different ways to print your dataframes:

>>> l = [[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]

>>> spark.createDataFrame(l, ["a", 'b']).show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
+---+---+

>>> print(spark.createDataFrame(l, ['a', 'b']).limit(5).toPandas())
   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5

df.show() will print top 20 rows, but you can pass a number to that, for n of rows.

You can also use df.limit(n).toPandas() to get a pandas style df.head()

Haha wow, I was really barking up the wrong tree there.. Thank you!! — Black Adder, Nov 06 '19 at 07:16

How do I print variables or dataframes to console in my PySpark program?

1 Answers1