1

I have a single script that:

  1. imports 2 sets of data: df_height['user', 'height'], df_age['user', 'age']
  2. clean the data
  3. analyse the data: i) sum(height), ii) mean(age), iii) sum(height) * mean(age)
  4. display the data.

I want to:

  • Separate the functions out into modules
  • divide the different analysis into their own 'main'
  • For each analysis, divide into i) import and clean, ii) process iii) display

Here is the complete script (in the comments with #-> I indicate in what folder the function will be moved to):

import pandas as pd
import numpy as np

#1. functions for import data #-> These functions into src/import_data/import_data.py
def get_data_age():  
    df = pd.DataFrame({
        "user_id":     ['1', '2', '3', '4', '5'], 
        "age":         [10,  20,  30, "55", 50], 
    })
    return df

def get_data_height(): 
    df = pd.DataFrame({
        "user_id":     ['5', '7', '12', '5'], 
        "height":      [160, 170, 180, 'replace_this_with_190']
    })
    return df

 #2. functions for cleaning data #-> These functions into src/clean_data/clean_data.py
def clean_age (df): 
    df['age'] = pd.to_numeric(df['age'])
    return df 

def clean_height (df): 
    df['height'] = df['height'].replace("replace_this_with_190", 200)
    return df 

 #3. functions for processing data #-> These functions into src/alghorithms/calculations.py
def alghorithm_age (df):
    return df['age'].mean()

def alghorithm_height (df):
    return df['height'].sum()

 #4. functions in common (display data) #-> This functions into src/display_data/display_data.py
def common_function_display_data (data): 
    print (data)

 #5. function that combines data from alghorithm_height and alghorithm_age #-> This functions into src/alghorithms/calculations.py
def product_age_mean_and_height_sum(mean_age, sum_height): 
    return mean_age * sum_height


#main 1 (age)
df_age = get_data_age()    # -> this step into file main_age/00_import_and_clean_age.py
df_age_clean = clean_age(df_age)  # -> this step into file main_age/00_import_and_clean_age.py
age_mean = alghorithm_age(df_age_clean) # -> this step into main_age/file 01_process_age.py
common_function_display_data(age_mean)# -> this step into main_age/file 02_display_age.py

#main 2 (height)
df_height = get_data_height()# -> this step into file main_height/00_import_and_clean_height.py
df_height_clean = clean_height(df_height)# -> this step into file main_height/00_import_and_clean_height.py
height_sum = alghorithm_height(df_height_clean)# -> this step into main_height/file 01_process_height.py
common_function_display_data(height_sum)# -> this step into file main_height/02_display_height.py

#main 3 (combined)
age_mean_height_sum_product = product_age_mean_and_height_sum(age_mean, height_sum) # -> this step into file main_display_combined/display_combined.py
common_function_display_data(age_mean_height_sum_product)# -> this step into file main_height/02_display_height.py

Here is the final project structure I had in mind.

github repo with example

project structure

Data flow

Problem However when i structure the project as above, I am unable to import modules into the main scripts. I believe this is because they are on parallel levels. getting the following error:

# EXAMPLE for file main_one_age/00_import_and_clean_age.py
---
from ..import_data.import_data import get_data_age
from ..clean_data.clean_data import clean_age

df_age = get_data_age()    # -> this step into file main_age/00_import_and_clean_age.py
df_age_clean = clean_age(df_age)  # -> this step into file main_age/00_import_and_clean_age.py

---
OUT:
    from ..import_data.import_data import get_data_age
ImportError: attempted relative import beyond top-level package
PS C:\Users\leodt\LH_REPOS\src\src>

QUESTIONS

Q: How can I separate the script into modules/main into a common structure?

The current solution doesn't allow me to:

  • place a main within a subfolder eg: main_one_age/main_here.py With this structure the code wont work
  • run files like import_and_clean_age_00.py as main, if I do this i get the error:
ef display_data_main_one(age_mean):
    return display_data.common_function_display_data(age_mean)
     
if __name__ == '__main__': 
    display_data("path to age mean")
    
 OUT: 
 ModuleNotFoundError: No module named 'display_data'

Q: can you provide a soltion that re-writes "path_for_data_etc.py" into a standard form? and also add all the setup.py/ pyproject.toml etc that is needed for this to be considered a "completed" project?

Basically looking for a standard solution that I can then use as a template for my real projects.

Leo
  • 1,176
  • 1
  • 13
  • 33

1 Answers1

0

1. How to separate the script into modules/main?

The structure shown in the question is appropriate.

The proper way to run each script is from the src/ directory, using the -m option:

python -m main_one_age.main_one_age_00_and_01_and_02
python -m main_one_age.import_and_clean_age_00

References:

1.1. How to support running the files directly?

(e.g. with the "play" button in VS Code)

You would need this boilerplate at the top of each of those files:

if __package__ is None:
    from pathlib import Path, sys
    sys.path.append(str(Path(__file__).resolve().parent.parent))

1.2. How to name main files?

Name main files as __main__.py, e.g. main_display_combined/__main__.py.
Reference: https://docs.python.org/3/library/__main__.html#main-py-in-python-packages

If you want to directly run a file named after its package and importing from its package,
i.e. main_display_combined/main_display_combined.py doing
from main_display_combined import display_combined, you would need:

if __package__ is None:
    from pathlib import Path, sys
    sys.path.append(str(Path(__file__).resolve().parent.parent))
    if str(Path(__file__).resolve().parent) in sys.path:
        sys.path.remove(str(Path(__file__).resolve().parent))

2. How to rewrite path_for_data_etc.py into a standard form?

The Pythonic way would be:

from pathlib import Path

ROOT_DIR = Path(__file__).resolve().parent

RAW_DATA_DIR = ROOT_DIR / "data" / "raw_data"
age_raw_data_path = RAW_DATA_DIR / "raw_age_data.csv"
height_raw_data_path = RAW_DATA_DIR / "raw_height_data.csv"

CLEAN_DATA_DIR = ROOT_DIR / "data" / "cleaned_data"
age_clean_data_path = CLEAN_DATA_DIR / "clean_age_data.csv"
height_clean_data_path = CLEAN_DATA_DIR / "clean_height_data.csv"

3. setup.py/pyproject.toml for this to be considered a "completed" project?

You just need a requirements.txt:

pandas

To install the requirements:

pip install -r requirements.txt
aaron
  • 39,695
  • 6
  • 46
  • 102
  • 1
    I suggest `python -m venv`. Yes, include a readme and license. I would further suggest to nest all the Python modules in a `myproject/` folder (name it according to your project, some prefer `src/` especially for other programming languages). – aaron Jan 03 '23 at 12:28
  • "You just need a requirements.txt:" no. Using a `setup.py` and `pyproject.toml` is the correct way to make your project a installable – juanpa.arrivillaga Jan 03 '23 at 19:01
  • "You would need this boilerplate at the top of each of those files" *no*. You should properly package and install your project, not mucking around with `sys.path`. – juanpa.arrivillaga Jan 03 '23 at 19:32
  • @juanpa.arrivillaga 1) This is an application, not a library. 2) How else can the file be run directly? Did you read the question and the rest of the answer? – aaron Jan 03 '23 at 23:41