Separate script's functions into modules, callable by 2 separate mains

Question

I have a single script that:

imports 2 sets of data: df_height['user', 'height'], df_age['user', 'age']
clean the data
analyse the data: i) sum(height), ii) mean(age), iii) sum(height) * mean(age)
display the data.

I want to:

Separate the functions out into modules
divide the different analysis into their own 'main'
For each analysis, divide into i) import and clean, ii) process iii) display

Here is the complete script (in the comments with #-> I indicate in what folder the function will be moved to):

import pandas as pd
import numpy as np

#1. functions for import data #-> These functions into src/import_data/import_data.py
def get_data_age():  
    df = pd.DataFrame({
        "user_id":     ['1', '2', '3', '4', '5'], 
        "age":         [10,  20,  30, "55", 50], 
    })
    return df

def get_data_height(): 
    df = pd.DataFrame({
        "user_id":     ['5', '7', '12', '5'], 
        "height":      [160, 170, 180, 'replace_this_with_190']
    })
    return df

 #2. functions for cleaning data #-> These functions into src/clean_data/clean_data.py
def clean_age (df): 
    df['age'] = pd.to_numeric(df['age'])
    return df 

def clean_height (df): 
    df['height'] = df['height'].replace("replace_this_with_190", 200)
    return df 

 #3. functions for processing data #-> These functions into src/alghorithms/calculations.py
def alghorithm_age (df):
    return df['age'].mean()

def alghorithm_height (df):
    return df['height'].sum()

 #4. functions in common (display data) #-> This functions into src/display_data/display_data.py
def common_function_display_data (data): 
    print (data)

 #5. function that combines data from alghorithm_height and alghorithm_age #-> This functions into src/alghorithms/calculations.py
def product_age_mean_and_height_sum(mean_age, sum_height): 
    return mean_age * sum_height


#main 1 (age)
df_age = get_data_age()    # -> this step into file main_age/00_import_and_clean_age.py
df_age_clean = clean_age(df_age)  # -> this step into file main_age/00_import_and_clean_age.py
age_mean = alghorithm_age(df_age_clean) # -> this step into main_age/file 01_process_age.py
common_function_display_data(age_mean)# -> this step into main_age/file 02_display_age.py

#main 2 (height)
df_height = get_data_height()# -> this step into file main_height/00_import_and_clean_height.py
df_height_clean = clean_height(df_height)# -> this step into file main_height/00_import_and_clean_height.py
height_sum = alghorithm_height(df_height_clean)# -> this step into main_height/file 01_process_height.py
common_function_display_data(height_sum)# -> this step into file main_height/02_display_height.py

#main 3 (combined)
age_mean_height_sum_product = product_age_mean_and_height_sum(age_mean, height_sum) # -> this step into file main_display_combined/display_combined.py
common_function_display_data(age_mean_height_sum_product)# -> this step into file main_height/02_display_height.py

Here is the final project structure I had in mind.

github repo with example

Problem However when i structure the project as above, I am unable to import modules into the main scripts. I believe this is because they are on parallel levels. getting the following error:

# EXAMPLE for file main_one_age/00_import_and_clean_age.py
---
from ..import_data.import_data import get_data_age
from ..clean_data.clean_data import clean_age

df_age = get_data_age()    # -> this step into file main_age/00_import_and_clean_age.py
df_age_clean = clean_age(df_age)  # -> this step into file main_age/00_import_and_clean_age.py

---
OUT:
    from ..import_data.import_data import get_data_age
ImportError: attempted relative import beyond top-level package
PS C:\Users\leodt\LH_REPOS\src\src>

QUESTIONS

Q: How can I separate the script into modules/main into a common structure?

The current solution doesn't allow me to:

place a main within a subfolder eg: main_one_age/main_here.py With this structure the code wont work
run files like import_and_clean_age_00.py as main, if I do this i get the error:

ef display_data_main_one(age_mean):
    return display_data.common_function_display_data(age_mean)
     
if __name__ == '__main__': 
    display_data("path to age mean")
    
 OUT: 
 ModuleNotFoundError: No module named 'display_data'

Q: can you provide a soltion that re-writes "path_for_data_etc.py" into a standard form? and also add all the setup.py/ pyproject.toml etc that is needed for this to be considered a "completed" project?

Basically looking for a standard solution that I can then use as a template for my real projects.

For now i am just running the scrips with the "play" button in Vs code. Any solution that works that you can suggest is ok — Leo, Jan 02 '23 at 14:02
mucking around with `sys.path` is absolutely the wrong way to go about this. — juanpa.arrivillaga, Jan 03 '23 at 19:33

aaron · Accepted Answer · 2023-01-02T16:23:13.060

1. How to separate the script into modules/main?

The structure shown in the question is appropriate.

The proper way to run each script is from the src/ directory, using the -m option:

python -m main_one_age.main_one_age_00_and_01_and_02
python -m main_one_age.import_and_clean_age_00

References:

1.1. How to support running the files directly?

(e.g. with the "play" button in VS Code)

You would need this boilerplate at the top of each of those files:

if __package__ is None:
    from pathlib import Path, sys
    sys.path.append(str(Path(__file__).resolve().parent.parent))

1.2. How to name main files?

Name main files as __main__.py, e.g. main_display_combined/__main__.py.
Reference: https://docs.python.org/3/library/__main__.html#main-py-in-python-packages

If you want to directly run a file named after its package and importing from its package,
i.e. main_display_combined/main_display_combined.py doing
from main_display_combined import display_combined, you would need:

if __package__ is None:
    from pathlib import Path, sys
    sys.path.append(str(Path(__file__).resolve().parent.parent))
    if str(Path(__file__).resolve().parent) in sys.path:
        sys.path.remove(str(Path(__file__).resolve().parent))

2. How to rewrite path_for_data_etc.py into a standard form?

The Pythonic way would be:

from pathlib import Path

ROOT_DIR = Path(__file__).resolve().parent

RAW_DATA_DIR = ROOT_DIR / "data" / "raw_data"
age_raw_data_path = RAW_DATA_DIR / "raw_age_data.csv"
height_raw_data_path = RAW_DATA_DIR / "raw_height_data.csv"

CLEAN_DATA_DIR = ROOT_DIR / "data" / "cleaned_data"
age_clean_data_path = CLEAN_DATA_DIR / "clean_age_data.csv"
height_clean_data_path = CLEAN_DATA_DIR / "clean_height_data.csv"

3. setup.py/pyproject.toml for this to be considered a "completed" project?

You just need a requirements.txt:

pandas

To install the requirements:

pip install -r requirements.txt

I suggest `python -m venv`. Yes, include a readme and license. I would further suggest to nest all the Python modules in a `myproject/` folder (name it according to your project, some prefer `src/` especially for other programming languages). — aaron, Jan 03 '23 at 12:28
"You just need a requirements.txt:" no. Using a `setup.py` and `pyproject.toml` is the correct way to make your project a installable — juanpa.arrivillaga, Jan 03 '23 at 19:01
"You would need this boilerplate at the top of each of those files" *no*. You should properly package and install your project, not mucking around with `sys.path`. — juanpa.arrivillaga, Jan 03 '23 at 19:32
@juanpa.arrivillaga 1) This is an application, not a library. 2) How else can the file be run directly? Did you read the question and the rest of the answer? — aaron, Jan 03 '23 at 23:41