2

I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub.

I'm able to plot in a single cell using matplotlib like below:

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5]
plt.plot(df)

Now the above code snippet works pretty neatly for me.

After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this:

-Cell 1-
sparkDS=spark.read.parquet('s3://bucket_name/path').cache()


-Cell 2-
from pyspark.sql.functions import *
sparkDS_groupBy=sparkDS.groupBy('col1').agg(count('*').alias('count')).orderBy('col1')
pandasDF=sparkDS_groupBy.toPandas()


-cell 3-
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.plot(pandasDF)

My code just fails in cell 3 with the following error:

NameError: name 'pandasDF' is not defined

Does anyone have any idea what's wrong?

Why the new cell in my jupyterhub notebook is not able to recognize a variable from the previous cell?

Does it have to do something with the '%matplotlib inline' magic command (I tried with '%matplotlib notebook' also, but failed)?

ps: I'm using AWS 5.19 EMR-Jupyterhub notebook setup for my plotting work.

This error is kind of similar to this one, but not a duplicate How do I make matplotlib work in AWS EMR Jupyter notebook?

Aman Mundra
  • 854
  • 12
  • 28
  • I can't see you defining ```pandasDF``` anywhere in the above code. Is there some code you're not showing us? – jwalton Jun 09 '19 at 16:52
  • I'm converting spark dataset to pandas dataframe in the second cell third line, like this: pandasDF=sparkDS_groupBy.toPandas() – Aman Mundra Jun 09 '19 at 17:04
  • Try to simplify the problem. Can you print the dataframe? If not, remove matplotlib from the equation. Can you use a python list instead of a dataframe to get the same error? If so, remove pandas from the problem, etc. etc. – ImportanceOfBeingErnest Jun 09 '19 at 17:06
  • printing the dataframe itself gives an error. Seems like matplotlib has an issue with pyspark kernel. in python kernel it runs fine – Aman Mundra Jun 09 '19 at 17:11

1 Answers1

1

You'll want to look into the %%spark -o df_name and %%local functions, by typing %%help in a cell.

Specifically, in your case try:

  1. Use %%spark -o sparkDS_groupBy at the start of -Cell 2-,
  2. Start -Cell 3- with %%local,
  3. And plot sparkDS_groupBy in -Cell 3- instead of pandasDF.

For those with less context, you can get plots by implementing the following in an EMR Notebook using PySpark kernel, attached to an EMR cluster that's at least version 5.26.0 (which introduces Notebook-Scoped Libraries.

(each code block represents a Cell)

%% help
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}}
sc.install_pypi_package("matplotlib")
%%spark -o my_df
# in this cell, my_df is a pyspark.sql.DataFrame
my_df = sc.read.text("s3://.../...")
%%local
%matplotlib inline

import matplotlib.pyplot as plt
# in this cell, my_df is a pandas.DataFrame
plt.plot(my_df)
yegeniy
  • 1,272
  • 13
  • 28
  • Note that there may be a limitation to the `-o` flag. It seems that only the last `-o` flag's value is respected per `%%spark`. If that's the case, just use more than one `%%spark` declaration. – yegeniy Sep 10 '19 at 19:49
  • technically, `matplotlib` is available by default, so installing it is unnecessary. – yegeniy Sep 10 '19 at 19:57