2

I wish to make a python wheel to upload as a spark-submit job in Azure Databricks, but I can't validate my wheel is working. I don't understand where or how a call to the wheel finds the __main__ method.

How do I build the package and/or call the wheel file on the command line in a way that the main method gets run?

Below is a simple package I'm attempting; all it does is echo today's date.

Some of the commands I've tried to run the script:

python dist/today-0.0.1-py3-none-any.whl/
python dist/today-0.0.1-py3-none-any.whl/__main__
python dist/today-0.0.1-py3-none-any.whl/main

I've tried a lot of variations on naming the main file main.py and __main__.py as well as naming the method main or __main__, but everything gives me the same error: C:\Python391\python.exe: can't find '__main__' module in 'C:\<DIRECTORYPATH>\\dist\\today-0.0.1-py3-none-any.whl'

The package consists of an empty __init__.py and main.py which looks like:

import datetime

def main():
    print(f'Today is {datetime.date.today()}')

if __name__ == '__main__':
    main()

My directory structure is:

Wheeltest
  |-- setup.py
  |-- today
       |-- __init__.py
       |-- __main__.py

I've unzipped the wheel file and can confirm that it has a top level of the folder today with the 2 py files inside it. My setup.py file looks like this (I've also tried without an entry_points section):

from setuptools import setup
from setuptools import find_packages

VERSION = '0.0.1'
DESCRIPTION = 'today package.'
LONG_DESCRIPTION = 'today dist.'

# Setting up
setup(
    name='today',
    version=VERSION,
    author='Simon Norton',
    author_email='<xxxxxxxx@yyyyy.com>',
    description=DESCRIPTION,
    long_description=LONG_DESCRIPTION,
    packages=find_packages(),
    entry_points={
        'console_scripts': ['main=today.__main__:main']
    },
    classifiers=['Development Status :: Testing',
        'Programming Language :: Python :: 3',
        'Operating System :: Microsoft :: Windows',
        'Operating System :: Linux']
)

Many thanks!

Simon Norton
  • 95
  • 11
  • "how a call to the wheel finds the __main__ method" you probably mean `main` function as that's what you have in your code. The answer is it won't. The wheel file is a built distribution file, intended to be installed. It's not standalone. Usual use for it is as a lib, not an app. If you want to make it an app you need to wrap it with something that includes it and launches the entry point explicitly. Maybe a shell script or https://pypi.org/project/py2exe/ Naming your function `main` is just convention. Naming a file `__main__` see https://stackoverflow.com/questions/4042905/what-is-main-py – Davos Nov 18 '21 at 17:48

1 Answers1

1

The path to the wheel file must contain the top-level folder inside the wheel to find __main__ From my Example, the top level folder is called "today":

C:\<DIRECTORYPATH>\\dist\\today-0.0.1-py3-none-any.whl\\today
Simon Norton
  • 95
  • 11
  • The whl file is a zip file plus manifest. Your directory structure includes a package `today` so when you run `python C:\\\dist\\today-0.0.1-py3-none-any.whl\\today` you are just benefiting from python being able to take zip files. If you include certain dependencies in your wheel, e.g. those with platform specific C libs, this won't work. Wheel is not ideal for spark-submit unless it is just a simple build. Databricks jobs API 2.1 has a new `python_wheel_task`. It would be installed e.g. `pip install ` and has `package_name` and `entry_point` attributes. – Davos Nov 18 '21 at 17:32
  • The other ways to run wheels in databricks all involve installing it as a library on your cluster first. You can then use a python script task or notebook task to import your library and then run the entrypoint of your choosing. I'm guessing the new `python_wheel_task` will just create the dummy script.py to import and run your library for you. Check out the possible `tasks` in the API docs https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate – Davos Nov 18 '21 at 17:36