0

I am trying to spin up on Python and PySpark. I followed this page on installing and checking PySpark in Anaconda on Windows. The following checking code works:

>>> import findspark
>>> findspark.init()
>>> findspark.find()
'C:\\Users\\User.Name\\anaconda3\\envs\\py39\\lib\\site-packages\\pyspark'

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('SparkExamples.com').getOrCreate()
>>> data = [("Java","20000"), ("Python","100000"), ("Scala","3000")]
>>> columns = ["language","users_count"]
>>> df = spark.createDataFrame(data).toDF(*columns)
>>> df.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+

I am tried accessing the online help for the methods createDataFrame and toDF. Getting help on createDataFrame was straighforward: help(spark.createDataFrame).

I haven't been able to access the online help for toDF:

>>> help(spark.toDF)
AttributeError: 'SparkSession' object has no attribute 'toDF'

>>> help(DataFrame.toDF)
NameError: name 'DataFrame' is not defined

>>> help(spark.DataFrame.toDF)
AttributeError: 'SparkSession' object has no attribute 'DataFrame'

>>> help(DataFrame)
NameError: name 'DataFrame' is not defined

>>> help(spark.DataFrame)
AttributeError: 'SparkSession' object has no attribute 'DataFrame'

(1) How is the documentation accessed?

(2) Is there a scheme for accessing the help that one can infer based on the checking code above?

user2153235
  • 388
  • 1
  • 11
  • Command line `help` is for the `PySpark` command line documentation access and not for Python Spark API. To see Spark API related help check official documentation here - https://spark.apache.org/docs/latest/api/python/index.html and for command line help try `>>> --help` – Vikramsinh Shinde Jul 25 '23 at 08:53
  • @Vikramsinh Shinde: I'm confused. The cited page says that PySpark *is* a Python API into Spark. Can you please clarify? Thanks. – user2153235 Jul 25 '23 at 15:53

1 Answers1

4

You need to import the DataFrame class from pyspark.sql

>>> from pyspark.sql import DataFrame
>>> help(DataFrame.toDF)
"""
Help on function toDF in module pyspark.sql.dataframe:

toDF(self, *cols)
    Returns a new class:`DataFrame` that with new specified column names

    :param cols: list of new column names (string)

    >>> df.toDF('f1', 'f2').collect()
    [Row(f1=2, f2='Alice'), Row(f1=5, f2='Bob')]
"""
enamya
  • 91
  • 2
  • 5
  • Thank you, *enamya*. To understand your answer, I read up on packages, modules, and different import syntaxes. I then found the `pyspark` package and `sql` subpackage in `sys.path`. The module `dataframe.py` therein contains the `toDF` method. I confirmed that the file was correct using `import inspect; inspect.getfile(DataFrame)`, which yields `C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark\sql\dataframe.py`. – user2153235 Jul 25 '23 at 19:20
  • The filename stem `dataframe`, however, does not match the *case* of name of the imported module `DataFrame`. My web searching indicates that Python is case sensitive. Curiously `dataframe.py` contains a `DataFrame` *class* for which the `toDF` method is defined. My understanding is that `from pyspark.sql import DataFrame` *should* instead import the subpackage module `pyspark/sql/module.py` (i.e., `from pyspark.sql import dataframe`) and that the `DataFrame` class should then be prefixed with the module name, i.e., `help(dataframe.DataFrame.toDF)`. – user2153235 Jul 25 '23 at 20:31
  • How is it possible that I can directly import the `DataFrame` class without mentioning the module name `dataframe`, and that I can then directly access class `DataFrame` without the module name `dataframe` as a prefix? – user2153235 Jul 25 '23 at 20:32
  • I wonder if the explanation is that `pyspark/sql/__init__.py` contains the initialization `from pyspark.sql.dataframe import DataFrame, ...`. It doesn't fully explain why `from pyspark.sql import DataFrame` works. The initialization puts `DataFrame` in the current namespace, but it doesn't make `DataFrame` into a module within the `pysppark.sql` subpackage. – user2153235 Jul 25 '23 at 22:24