I am running pyspark on an aws emr. I have a jupyter notebook, running in jupyter hub on the aws emr. I have read data into a spark dataframe named clusters_df. I'm now trying to create a simple line chart with k as the x axis and score as the y axis. I tried converting the dataframe to a pandas dataframe, since I don't think spark has built in data visualization. When I try to display the chart in the jupyter notebook I'm getting the messages below. I've also tried matplotlib. Both code examples are below, with the messages that get returned. Can anyone suggest how to create a line chart with a jupyter notebook running pyspark on an emr?
libraries imported:
import pyspark
##### running on emr
## function to create all tables
from pyspark.sql.types import *
from pyspark.context import SparkContext
from pyspark.sql import Window
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import first
import pyspark.sql.functions as func
from pyspark.sql.functions import lit,StringType,coalesce,lag,trim, upper, substring
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import round, explode,row_number,udf, length, min, when, format_number
from pyspark.sql.functions import hour, year, month, dayofmonth, date_add, to_date,datediff,dayofyear, weekofyear, date_format, unix_timestamp
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import MinMaxScaler, PCA
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import StandardScaler
import traceback
import sys
import time
import math
import datetime
import numpy as np
import pandas as pd
UPdate: I want to clarify I'm showing the two code examples below to show two examples of trying to create a linechart visualization in a jupyter notebook running with spark on an emr, that both fail to produce a line chart visualization.
the panadas example just returns the text shown. the matplotlib example returns the error shown because it doesn't seem to recognize spark anymore once the magic code is run in the cell to import matplotlib.
importing dataframe:
clusters_df=sqlContext.read.parquet("path")
code:
clusters_df.toPandas().plot.line(x="k",y="score");
output:
<AxesSubplot:xlabel='k'>
code:
%matplotlib inline
import matplotlib.pyplot as plt
pnds_df=clusters_df.toPandas()
plt.plot(pnds_df['k'],pnds_df['score'])
plt.show()
output:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-5e7649bc56fb> in <module>
3 import matplotlib.pyplot as plt
4
----> 5 pnds_df=clusters_df.toPandas()
6
7 plt.plot(pnds_df['k'],pnds_df['score'])
NameError: name 'clusters_df' is not defined