1

I'm looking for the name for a procedure which handles output from one function in several others (trying to find better words for my problem). Some pseudo/actual code would be really helpful.

I have written the following code:

def read_data():
    read data from a file
    create df
    return df

def parse_data():
    sorted_df = read_data()
    count lines
    sort by date
    return sorted_df

def add_new_column(): 
    new_column_df = parse_data()
    add new column
    return new_column_df

def create_plot():
    plot_data = add_new_column()
    create a plot
    display chart

What I'm trying to understand is how to skip a function, e.g. create following chain read_data() -> parse_data() -> create_plot().

As the code looks right now (due to all return values and how they are passed between functions) it requires me to change input data in the last function, create_plot().

I suspect that I'm creating logically incorrect code.

Any thoughts?

Original code:

import pandas as pd
import matplotlib.pyplot as plt

# Read csv files in to data frame
def read_data():
    raw_data = pd.read_csv('C:/testdata.csv', sep=',', engine='python', encoding='utf-8-sig').replace({'{':'', '}':'', '"':'', ',':' '}, regex=True)
    return raw_data

def parse_data(parsed_data):
    ...
    # Convert CreationDate column into datetime
    raw_data['CreationDate'] = pd.to_datetime(raw_data['CreationDate'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
    raw_data.sort_values(by=['CreationDate'], inplace=True, ascending=True)
    parsed_data = raw_data
    return parsed_data

raw_data = read_files()
parsed = parsed_data(raw_data)
  • 3
    Use *function parameters*, e.g. `def parse_data(data)`. Instead of having `parse_data` call `read_data`, pass that data from one to the other: `parse_data(read_data())`. This way each function is independent and you can chain them flexibly. – deceze Oct 20 '19 at 14:05
  • https://docs.python.org/3/tutorial/controlflow.html#defining-functions – wwii Oct 20 '19 at 14:07

2 Answers2

3

Pass the data in instead of just effectively "nesting" everything. Any data that a function requires should ideally be passed in to the function as a parameter:

def read_data():
    read data from a file
    create df
    return df

def parse_data(sorted_df):
    count lines
    sort by date
    return sorted_df

def add_new_column(new_column_df):
    add new column
    return new_column_df

def create_plot(plot_data):  
    create a plot
    display chart

df = read_data()
parsed = parse_data(df)
added = add_new_column(parsed)
create_plot(added)

Try to make sure functions are only handling what they're directly responsible for. It isn't parse_data's job to know where the data is coming from or to produce the data, so it shouldn't be worrying about that. Let the caller handle that.

The way I have things set up here is often referred to as "piping" or "threading". Information "flows" from one function into the next. In a language like Clojure, this could be written as:

(-> (read-data)
    (parse-data)
    (add-new-column)
    (create-plot))

Using the threading macro -> which frees you up from manually needing to handle data passing. Unfortunately, Python doesn't have anything built in to do this, although it can be achieved using external modules.


Also note that since dataframes seem to be mutable, you don't actually need to return the altered ones them from the functions. If you're just mutating the argument directly, you could just pass the same data frame to each of the functions in order instead of placing it in intermediate variables like parsed and added. The way I'm showing here is a general way to set things up, but it can be altered depending on your exact use case.

Carcigenicate
  • 43,494
  • 9
  • 68
  • 117
  • I'm trying to redo the code, but facing some issues with `local variable sorted_df referenced before before assignment`. I mark your answer as solution, I think I'm just too tired right now. – Tore Djerberg Oct 20 '19 at 15:51
  • @ToreDjerberg I would need to see the code causing that error to be able to help with that. – Carcigenicate Oct 20 '19 at 15:53
  • I think the issue is in the `parse_data` function. When parsing the data I use following code snippet (pseudo): `def parse_data(): df['Column1'] = df['Column1'].convert_to_date`. When executing, `df` is marked as referenced before assignment. – Tore Djerberg Oct 20 '19 at 16:02
  • @ToreDjerberg Note that I showed that that data needs to be passed in using a parameter. Review my `parse_data` function again and note how I'm calling it. – Carcigenicate Oct 20 '19 at 16:08
  • @ToreDjerberg Just a typo. You create a variable called `data` (`data = read_files()`), but then you try to refer to it as `raw_data` instead (`parse_data(raw_data)`). Just use `data`: `parse_data(data)`. – Carcigenicate Oct 20 '19 at 16:22
  • I'm getting too tired I guess. Fixing the typo doesn't solve the issue. I'm getting `UnboundLocalError: local variable 'raw_data' referenced before assignment` which is pointing to first mention of raw_data inside parse_data. – Tore Djerberg Oct 20 '19 at 16:30
  • @ToreDjerberg Same type of issue. You call the parameter `parsed_data`, but then you try to use it as `raw_data`. – Carcigenicate Oct 20 '19 at 16:32
  • @ToreDjerberg You might just need to take a step back and practice using function parameters so they make more sense, then come back to this problem. – Carcigenicate Oct 20 '19 at 16:34
-1

Use class to contain your code

class DataManipulation:
    def __init__(self, path):
        self.df = pd.DataFrame()
        self.read_data(path)

    @staticmethod
    def new(file_path):
        return DataManipulation(path)

    def read_data(self, path):
        read data from a file
        self.df = create df

    def parse_data(self):
        use self.df
        count lines
        sort by date
        return self

    def add_new_column(self):
        use self.df
        add new column
        return self

    def create_plot(self):
        plot_data = add_new_column()
        create a plot
        display chart
        return self

And then,

 d = DataManipulation.new(filepath).parse_data().add_column().create_plot()
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • 1
    Why `.new`? Just `DataManipulation(filepath)` will do the same thing. It's not really a great idea to use a class whose methods must be called in a specific order. Without calling `parse_data`, the rest of the methods won't do anything. Explicitly returning `self` all the time is… questionable. If you want a fluent interface, it's okay, but this doesn't seem like a good use case for a fluent interface. – deceze Oct 20 '19 at 15:25
  • There are different design patterns in python. One of them is the builder pattern which is quite popular. This answer is inspired from that pattern. In actual pattern when you do `new()` the function returns a builder object which hides the attributes and adds functionality. Class itself doesn't specify any order. It has functions that return `self` to implement chaining. If you need an example [pandas itself uses it](https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L3071). @deceze @Tore Djerberg – Vishnudev Krishnadas Oct 20 '19 at 16:10