I am trying to create some scripts for pyspark using pycharm. While I found multiple explanation on how to connect them (such as How to link PyCharm with PySpark?) not everything works properly.
What I did is basically set the environment variables correctly:
echo $PYTHONPATH
:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip
echo $SPARK_HOME
/usr/local/spark
and in the code I have:
appName = "demo1"
master = "local"
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
The problem is that many dataframe aggregation functions appear as errors. For example I have the following lines:
from pyspark.sql import functions as agg_funcs
maxTimeStamp = base_df.agg(agg_funcs.max(base_df.time)).collect()
Yet pycharm claims: Cannot find reference 'max' in functions.py A similar error apepars for most aggregate functions (e.g. col, count)
How would I fix this?