0

I am running pyspark on an aws emr. I have a jupyter notebook, running in jupyter hub on the aws emr. I have read data into a spark dataframe named clusters_df. I'm now trying to create a simple line chart with k as the x axis and score as the y axis. I tried converting the dataframe to a pandas dataframe, since I don't think spark has built in data visualization. When I try to display the chart in the jupyter notebook I'm getting the messages below. I've also tried matplotlib. Both code examples are below, with the messages that get returned. Can anyone suggest how to create a line chart with a jupyter notebook running pyspark on an emr?

libraries imported:

import pyspark
##### running on emr
## function to create all tables
from pyspark.sql.types import *
from pyspark.context import SparkContext
from pyspark.sql import Window
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession

from pyspark.sql.functions import col
from pyspark.sql.functions import first
import pyspark.sql.functions as func
from pyspark.sql.functions import lit,StringType,coalesce,lag,trim, upper, substring
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import round, explode,row_number,udf, length, min, when, format_number
from pyspark.sql.functions import  hour, year, month, dayofmonth, date_add, to_date,datediff,dayofyear, weekofyear, date_format, unix_timestamp

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import MinMaxScaler, PCA
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import StandardScaler


import traceback
import sys
import time
import math
import datetime
import numpy as np
import pandas as pd

UPdate: I want to clarify I'm showing the two code examples below to show two examples of trying to create a linechart visualization in a jupyter notebook running with spark on an emr, that both fail to produce a line chart visualization.

the panadas example just returns the text shown. the matplotlib example returns the error shown because it doesn't seem to recognize spark anymore once the magic code is run in the cell to import matplotlib.

importing dataframe:

clusters_df=sqlContext.read.parquet("path")

code:

clusters_df.toPandas().plot.line(x="k",y="score");

output:

<AxesSubplot:xlabel='k'>

code:

%matplotlib inline

import matplotlib.pyplot as plt

pnds_df=clusters_df.toPandas()

plt.plot(pnds_df['k'],pnds_df['score'])

plt.show()

output:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-33-5e7649bc56fb> in <module>
      3 import matplotlib.pyplot as plt
      4 
----> 5 pnds_df=clusters_df.toPandas()
      6 
      7 plt.plot(pnds_df['k'],pnds_df['score'])

NameError: name 'clusters_df' is not defined
user3476463
  • 3,967
  • 22
  • 57
  • 117
  • Does this answer your question? [Python NameError: name is not defined](https://stackoverflow.com/questions/14804084/python-nameerror-name-is-not-defined) – Michael Delgado Aug 13 '22 at 18:51
  • I'm actually confused by this question. did you mean to show a different traceback? a name error seems like a pretty straightforward issue to me. – Michael Delgado Aug 13 '22 at 18:51
  • @MichaelDelgado thank you for getting back to me. I added a little more explanation to my original post. I'm trying to display a line chart in a jupyter notebook running spark basically. I tried it two different ways and they both fail. so that's why there's a pandas example and a matplotlib example. – user3476463 Aug 15 '22 at 00:21
  • ok - in the first one - don't use a semicolon. you want the returned plot to actually return & be displayed. in the second one... I dunno - did you restart the kernel? unless you did something super weird to your environment I don't see how `%matplotlib inline` could spin off it's own process. it looks like you just need to make sure to execute your imports & definitions before running the pandas example. – Michael Delgado Aug 15 '22 at 00:28
  • @MichaelDelgado I've run it without the semicolon, it returns the same message. this is a jupyter notebook running pyspark on emr, not the usual pandas and just python. spark doesn't have good data viz capabilities. – user3476463 Aug 15 '22 at 03:02
  • But the axes subplot object is the plot. Did you run matplotlib inline before that cell? Unfortunately, this seems like an environment issue. The code you’ve written should produce a plot (and in fact it is producing a plot) - not sure how to help if it’s not displaying. Have you messed with the matplotlib backend defaults or something? Can you save the figure? – Michael Delgado Aug 15 '22 at 03:46

0 Answers0