I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub.
I'm able to plot in a single cell using matplotlib like below:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5]
plt.plot(df)
Now the above code snippet works pretty neatly for me.
After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this:
-Cell 1-
sparkDS=spark.read.parquet('s3://bucket_name/path').cache()
-Cell 2-
from pyspark.sql.functions import *
sparkDS_groupBy=sparkDS.groupBy('col1').agg(count('*').alias('count')).orderBy('col1')
pandasDF=sparkDS_groupBy.toPandas()
-cell 3-
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(pandasDF)
My code just fails in cell 3 with the following error:
NameError: name 'pandasDF' is not defined
Does anyone have any idea what's wrong?
Why the new cell in my jupyterhub notebook is not able to recognize a variable from the previous cell?
Does it have to do something with the '%matplotlib inline' magic command (I tried with '%matplotlib notebook' also, but failed)?
ps: I'm using AWS 5.19 EMR-Jupyterhub notebook setup for my plotting work.
This error is kind of similar to this one, but not a duplicate How do I make matplotlib work in AWS EMR Jupyter notebook?