I have a single script that:
- imports 2 sets of data: df_height['user', 'height'], df_age['user', 'age']
- clean the data
- analyse the data: i) sum(height), ii) mean(age), iii) sum(height) * mean(age)
- display the data.
I want to:
- Separate the functions out into modules
- divide the different analysis into their own 'main'
- For each analysis, divide into i) import and clean, ii) process iii) display
Here is the complete script (in the comments with #-> I indicate in what folder the function will be moved to):
import pandas as pd
import numpy as np
#1. functions for import data #-> These functions into src/import_data/import_data.py
def get_data_age():
df = pd.DataFrame({
"user_id": ['1', '2', '3', '4', '5'],
"age": [10, 20, 30, "55", 50],
})
return df
def get_data_height():
df = pd.DataFrame({
"user_id": ['5', '7', '12', '5'],
"height": [160, 170, 180, 'replace_this_with_190']
})
return df
#2. functions for cleaning data #-> These functions into src/clean_data/clean_data.py
def clean_age (df):
df['age'] = pd.to_numeric(df['age'])
return df
def clean_height (df):
df['height'] = df['height'].replace("replace_this_with_190", 200)
return df
#3. functions for processing data #-> These functions into src/alghorithms/calculations.py
def alghorithm_age (df):
return df['age'].mean()
def alghorithm_height (df):
return df['height'].sum()
#4. functions in common (display data) #-> This functions into src/display_data/display_data.py
def common_function_display_data (data):
print (data)
#5. function that combines data from alghorithm_height and alghorithm_age #-> This functions into src/alghorithms/calculations.py
def product_age_mean_and_height_sum(mean_age, sum_height):
return mean_age * sum_height
#main 1 (age)
df_age = get_data_age() # -> this step into file main_age/00_import_and_clean_age.py
df_age_clean = clean_age(df_age) # -> this step into file main_age/00_import_and_clean_age.py
age_mean = alghorithm_age(df_age_clean) # -> this step into main_age/file 01_process_age.py
common_function_display_data(age_mean)# -> this step into main_age/file 02_display_age.py
#main 2 (height)
df_height = get_data_height()# -> this step into file main_height/00_import_and_clean_height.py
df_height_clean = clean_height(df_height)# -> this step into file main_height/00_import_and_clean_height.py
height_sum = alghorithm_height(df_height_clean)# -> this step into main_height/file 01_process_height.py
common_function_display_data(height_sum)# -> this step into file main_height/02_display_height.py
#main 3 (combined)
age_mean_height_sum_product = product_age_mean_and_height_sum(age_mean, height_sum) # -> this step into file main_display_combined/display_combined.py
common_function_display_data(age_mean_height_sum_product)# -> this step into file main_height/02_display_height.py
Here is the final project structure I had in mind.
Problem However when i structure the project as above, I am unable to import modules into the main scripts. I believe this is because they are on parallel levels. getting the following error:
# EXAMPLE for file main_one_age/00_import_and_clean_age.py
---
from ..import_data.import_data import get_data_age
from ..clean_data.clean_data import clean_age
df_age = get_data_age() # -> this step into file main_age/00_import_and_clean_age.py
df_age_clean = clean_age(df_age) # -> this step into file main_age/00_import_and_clean_age.py
---
OUT:
from ..import_data.import_data import get_data_age
ImportError: attempted relative import beyond top-level package
PS C:\Users\leodt\LH_REPOS\src\src>
QUESTIONS
Q: How can I separate the script into modules/main into a common structure?
The current solution doesn't allow me to:
- place a main within a subfolder eg: main_one_age/main_here.py With this structure the code wont work
- run files like
import_and_clean_age_00.py
as main, if I do this i get the error:
ef display_data_main_one(age_mean):
return display_data.common_function_display_data(age_mean)
if __name__ == '__main__':
display_data("path to age mean")
OUT:
ModuleNotFoundError: No module named 'display_data'
Q: can you provide a soltion that re-writes "path_for_data_etc.py" into a standard form? and also add all the setup.py/ pyproject.toml etc that is needed for this to be considered a "completed" project?
Basically looking for a standard solution that I can then use as a template for my real projects.