3

I want to structure a Python repo with multiple Spark applications, each one is a separate application. I want to be able to have some common packages which all other can use, and some packages which are standalone spark applications.

I need to be able to build each of the packages separately into a wheel file, both the common packages and the standalone spark applications.

Also I want to have test files for each of these packages separately.

Is the following structure a good practice?

root
├── common_package_a
│   ├── package_a_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py
├── common_package_b
│   ├── package_b_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py
│   .
│   .
│   .
├── spark_application_a
│   ├── spark_application_a_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py
├── spark_application_b
│   ├── spark_application_b_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py

I can't find a recommended structure for this goal, all examples of how to build a python project always have a single setup.py in the root dir, a single venv for the entire project.

I've looked at some questions similar to mine:

  1. https://discuss.python.org/t/how-to-best-structure-a-large-project-into-multiple-installable-packages/5404/2
  2. How do you organise a python project that contains multiple packages so that each file in a package can still be run individually?

Thanks!

Or Bar Yaacov
  • 259
  • 3
  • 13

1 Answers1

0

I ended up doing the following:

Structure:

root
├── common_project_a
│   ├── package_a_src
│   ├── package_a_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py
├── common_project_b
│   ├── package_b_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py
│   .
│   .
│   .
├── spark_application_a
│   ├── spark_application_a_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py
├── spark_application_b
│   ├── spark_application_b_tests
│   ├── requirements.txt
│   ├── venv
│   ├── setup.py

Build process: In the setup.py of spark_application_a I use the following:

from distutils.dir_util import copy_tree
from shutil import rmtree

src_path = os.environ["PWD"] + "/../common_package_a/package_a_src/"
dst_path = "./package_a_src"
copy_tree(src_path, dst_path)

setup(
.
.
.
)

rmtree(dst_path)

All projects in same pycharm workspace: In pycharm I open each module as it's own project, i.e I open first common_project_a, then I open a second project spark_application_a and use the "attach" button to open in the same window.

Right venv for each project: In preferences-> project-> Python Interpreter I choose, for each of the opened projects, the venv that I create under it.

Add projects dependencies: To make everything work locally in pycharm I add dependencies between projects: In preferences-> project-> Project Dependencies, for each project mark which projects it depends on.

This way the imports in spark_application_a are compatible between the wheel and inside pycharm.

Or Bar Yaacov
  • 259
  • 3
  • 13