I'm spinning up on both Python and PySpark. I followed this page on installing PySpark in Anaconda on Windows. I tried to get online help on a DataFrame
class and its toDF
method. From this explanation, the required import (and subsequent help commands) are:
from pyspark.sql import DataFrame # User import command
help(DataFrame)
help(DataFrame.toDF)
The code works, but I don't understand why, even after reading extensively on packages, modules, and initialization (e.g., here, here, and here).
The DataFrame
class is defined in package pyspark
, subpackage sql
, module file dataframe.py
. File pyspark/sql/__init__.py
contains initialization
# __init__.py import command
from pyspark.sql.dataframe import DataFrame, DataFrameNaFunctions, DataFrameStatFunctions
I see how this __init__.py import command
puts the DataFrame
class in the current namespace. In order for the User import command
at the top to run, however, DataFrame
must appear like a module in the pyspark.sql
subpackage. I don't see how the __init__.py import command
accomplishes this.
Can someone explain, point to a key passage in one of my cited resources, and/or refer me to other information?