Design patterns for chaining data transformations methods using pandas

Question

I receive monthly a csv file that has some columns. Regardless of what columns I receive, I should output a csv with column C1, C2, C3, ... C29, C30 if possible + a log file with the steps I took.

I know that, the order of my data transformations should be t1, t2, t3, t4, t5.

t1 generates columns C8, C9, C12, C22 using C1, C2, C3, C4
t2 generates columns C10, C11, C17 using C3, C6, C7, C8
t3 generates columns C13, C14, C15, C16 using C5, C8, C10, C11, C22
t4 generates columns C18, C19, C20, C21, C23, C24, C25 using C13, C15
t5 generates columns C26, C27, C28, C29, C30 using C5, C19, C20, C21

I cannot control what columns I get in my input data.

If my input data has C1, C2, C3, C4, C5, C6, C7 columns I can generate all the C1 ... C30 columns.

If my input data has C1, C2, C3, C4, C5, C6, C7, C8, C10, C11, C17 columns I can generate all the C1 ... C30 columns, but I should skip t2, as it is not necessary

If my input data has C1, C2, C3, C4, C6, C7 I can only do t1, t2, t3, t4. I cannot run t5, therefore I should create C26, C27, C28, C29, C30 columns with NaN values only and I should add in the log "Cannot perform t5 transformation because C5 is missing. C26, C27, C28, C29, C30 are filled with NaN values"

My t1, t2, t3, t4, t5 are already created, but I don't know how to organize the code in an elegant manner such that the code repetitions are minimal.

I had to develop my code in a very short amount of time. Consequently, all my t1, t2, t3, t4, t5 methods look like

def ti(df):
    output_cols = get_output_cols()
    if output_cols_already_exist(df, output_cols):
        return df, "{} skipped, the output cols {} already exist".format(inspect.stack()[0][3], output_cols)
    else:
        input_cols = get_required_input_cols()
        missing_cols = get_missing_cols(df, input_cols):
        if missing_cols == []:
            // do stuff
            log = "Performed {} transformation. Created {} columns".format(inspect.stack()[0][3], input_cols)
        else:
            for col in input_cols:
                df[col] = np.NaN
            log = "Cannot perform {} transformation because {} columns are missing. {} are filled with NaN values".format(inspect.stack()[0][3], missing_cols, output_cols)

Also, I use the functions in the following way:

text = ""
df = pd.read_csv(input_path)
df, log_text = t1(df)
text = text + log_text + "\n"
df, log_text = t2(df)
text = text + log_text + "\n"
df, log_text = t3(df)
text = text + log_text + "\n"
df, log_text = t4(df)
text = text + log_text + "\n"
df, log_text = t5(df)
text = text + log_text + "\n"
df.to_csv("output_data.csv", index = False)
logging.info(text)

As you can see, my code is ugly and repetitive. Now I have time to refactor it, but I don't know what would be the best approach. I also want my code to be extensible, as I am also thinking about adding a t6 transform. Can you help me giving some directions / design patterns I could follow? (I am also open using other python libraries beyond pandas)

Laurent · Accepted Answer · 2021-05-02T06:40:07.507

Since, in Python, functions are first-class objects, you could refactor your code in order to generalize your t[i] functions by extracting what seems to differentiate them (the do stuff part), make it a helper function and treat it as a parameter.

You can also avoid repetitions when calling the functions (either the t1, t2, etc. or the refactored helper versions hereafter) by iterating on a list.

Lastly, use of f-strings help make your code a little more readable.

Something like this:

# t function takes a dataframe and a function as parameters
def t(df, do_stuff_func):
    output_cols = get_output_cols()
    if output_cols_already_exist(df, output_cols):
        return (
            df,
            (
                f"{inspect.stack()[0][3]} skipped, "
                f"the output cols {output_cols} already exist",
            ),
        )
    else:
        input_cols = get_required_input_cols()
        missing_cols = get_missing_cols(df, input_cols)
        if missing_cols == []:
            # Call the helper function
            do_stuff_func()
            log = (
                f"Performed {inspect.stack()[0][3]} transformation."
                f"Created {input_cols} columns"
            )
        else:
            for col in input_cols:
                df[col] = np.NaN
            log = (
                f"Cannot perform {inspect.stack()[0][3]} transformation"
                f"because {missing_cols} columns are missing. "
                f"{output_cols} are filled with NaN values"
            )

# Define the five new 'do_stuff' functions
def do_stuff1():
    pass
...
def do_stuff5():
    pass

# Store the functions
do_stuff_funcs = [do_stuff1, do_stuff2, do_stuff3, do_stuff4, do_stuff5]

# Call t function in combination with df and do_stuff_funcs helpers
for do_stuff_func in do_stuff_funcs:
    df, log_text = t(df, do_stuff_func)
    text = text + log_text + "\n"

# Save the results
df.to_csv("output_data.csv", index = False)
logging.info(text)

Thanks for the answer. While I like your solution much more than mine, I keep asking whether it can be done better. I try to follow the best practices in terms of structuring the code and I don't know whether this approach is good enough or not. More precisely, shouldn't I create any classes? Shouldn't I transform the "t" function in some kind of decorator for "do_stuff" functions? — pentavol, May 02 '21 at 10:37
Classes are especially useful when you want to keep track of the state of an object. Otherwise, it's mainly a way to organize your code (see for instance https://stackoverflow.com/a/33072722/11246056). Decorators are just syntactic sugar, they make sense if you have lots of functions for which you need to apply the same logic (see for instance https://stackoverflow.com/a/52593993/11246056). In your case, unless you expect more do_stuff functions, I would keep things simple and avoid both. — Laurent, May 02 '21 at 11:55
All my `t1`,`t2`, `t3`, `t4`, `t5` functions modify the same dataframe `df`. In this context I was thinking about classes, but I am not an expert. I will probably need to add just 2-3 more do_stuff methods. I don't know whether this number is high enough to think about the decorator "syntactic sugar" or not. At the first glance, a decorator should be ok for my problem (gives me flexibility of removing "t" from a do-stuff function) I will wait for 1-2 more days to see if other answers appear. If not, I will accept yours as it is obviously and improvement over my previous code. — pentavol, May 02 '21 at 12:09
Thanks. As a final note, I recommend this "Guide to design patterns", by @Brandon Rhodes: https://python-patterns.guide/ — Laurent, May 02 '21 at 17:02

Design patterns for chaining data transformations methods using pandas

1 Answers1