Converting an imperative algorithm that "grows" a table into pure functions

Question

My program, written in Python 3, has many places where it starts with a (very large) table-like numeric data structure and adds columns to it following a certain algorithm. (The algorithm is different in every place.)

I am trying to convert this into pure functional approach since I run into problems with the imperative approach (hard to reuse, hard to memoize interim steps, hard to achieve "lazy" computation, bug-prone due to reliance on state, etc.).

The Table class is implemented as a dictionary of dictionaries: the outer dictionary contains rows, indexed by row_id; the inner contains values within a row, indexed by column_title. The table's methods are very simple:

# return the value at the specified row_id, column_title
get_value(self, row_id, column_title)

# return the inner dictionary representing row given by row_id
get_row(self, row_id) 

# add a column new_column_title, defined by func
# func signature must be: take a row and return a value
add_column(self, new_column_title, func)

Until now, I simply added columns to the original table, and each function took the whole table as an argument. As I'm moving to pure functions, I'll have to make all arguments immutable. So, the initial table becomes immutable. Any additional columns will be created as standalone columns and passed only to those functions that need them. A typical function would take the initial table, and a few columns that are already created, and return a new column.

The problem I run into is how to implement the standalone column (Column)?

I could make each of them a dictionary, but it seems very expensive. Indeed, if I ever need to perform an operation on, say, 10 fields in each logical row, I'll need to do 10 dictionary lookups. And on top of that, each column will contain both the key and the value, doubling its size.

I could make Column a simple list, and store in it a reference to the mapping from row_id to the array index. The benefit is that this mapping could be shared between all columns that correspond to the same initial table, and also once looked up once, it works for all columns. But does this create any other problems?

If I do this, can I go further, and actually store the mapping inside the initial table itself? And can I place references from the Column objects back to the initial table from which they were created? It seems very different from how I imagined a functional approach to work, but I cannot see what problems it would cause, since everything is immutable.

In general does functional approach frown on keeping a reference in the return value to one of the arguments? It doesn't seem like it would break anything (like optimization or lazy evaluation), since the argument was already known anyway. But maybe I'm missing something.

Have you thought about using a Numpy array? This sounds like just the sort of thing Numpy was designed to handle: it's easy to access a particular column or row and pass them around as arguments, and it's very quick, especially for pure-numeric operations. — Benjamin Hodgson, Sep 18 '12 at 21:59
`numpy.array` would work great; `pandas.DataFrame` even better. But unfortunately, I really want to move to pure functions for the reasons I outlined. And this requires immutability, so I cannot just extend the original table by appending newly calculated columns. — max, Sep 18 '12 at 22:21
You can still write functions that accept and return Numpy arrays of whatever dimensions (a single column, a whole table...) without departing the functional style. IMO, the mutability of the `ndarray` type is neither here nor there; you just need to ensure none of your functions have side effects. If you don't want to make use of the mutability, then so be it. — Benjamin Hodgson, Sep 18 '12 at 22:32
Oh then of course. `ndarray` is a reasonable choice for the implementation of `Column` and `Table` classes. — max, Sep 18 '12 at 22:37
See also http://stackoverflow.com/questions/5541324/immutable-numpy-array — Benjamin Hodgson, Sep 18 '12 at 22:37
Why not just keep a list of the row id's and the function for each column? Referential transparency FTW. What's the point of sticking it in a table? — NovaDenizen, Sep 21 '12 at 19:51
@NovaDenizen I would. But the functions I am working with require a whole column. For example, a typical function might be `rank` (take a column of numbers, and return a column that contains the rank for each number in descending order). — max, Sep 22 '12 at 02:17

score 1 · Accepted Answer · answered Sep 25 '12 at 17:19

Here is how I would do it:

Derive your table class from a frozenset.
Each row should be a sublcass of tuple.

Now you can't modify the table -> immutability, great! The next step could be to consider each function a mutation which you apply to the table to produce a new one:

f T -> T'

That should be read as apply the function f on the table T to produce a new table T'. You may also try to objectify the actual processing of the table data and see it as an Action which you apply or add to the table.

add(T, A) -> T'

The great thing here is that add could be subtract instead giving you an easy way to model undo. When you get into this mindset, your code becomes very easy to reason about because you have no state that can screw things up.

Below is an example of how one could implement and process a table structure in a purely functional way in Python. Imho, Python is not the best language to learn about FP in because it makes it to easy to program imperatively. Haskell, F# or Erlang are better choices I think.

class Table(frozenset):
    def __new__(cls, names, rows):
        return frozenset.__new__(cls, rows)

    def __init__(self, names, rows):
        frozenset.__init__(self, rows)
        self.names = names

def add_column(rows, func):
    return [row + (func(row, idx),) for (idx, row) in enumerate(rows)]

def table_process(t, (name, func)):
    return Table(
        t.names + (name,),
        add_column(t, lambda row, idx: func(row))
        )

def table_filter(t, (name, func)):
    names = t.names
    idx = names.index(name)
    return Table(
        names,
        [row for row in t if func(row[idx])]
        )

def table_rank(t, name):
    names = t.names
    idx = names.index(name)
    rows = sorted(t, key = lambda row: row[idx])
    return Table(
        names + ('rank',),
        add_column(rows, lambda row, idx: idx)
        )

def table_print(t):
    format_row = lambda r: ' '.join('%15s' % c for c in r)
    print format_row(t.names)
    print '\n'.join(format_row(row) for row in t)

if __name__ == '__main__':
    from random import randint
    cols = ('c1', 'c2', 'c3')
    T = Table(
        cols,
        [tuple(randint(0, 9) for x in cols) for x in range(10)]
        )
    table_print(T)

    # Columns to add to the table, this is a perfect fit for a
    # reduce. I'd honestly use a boring for loop instead, but reduce
    # is a perfect example for how in FP data and code "becomes one."
    # In fact, this whole program could have been written as just one
    # big reduce.
    actions = [
        ('max', max),
        ('min', min),
        ('sum', sum),
        ('avg', lambda r: sum(r) / float(len(r)))
        ]
    T = reduce(table_process, actions, T)
    table_print(T)

    # Ranking is different because it requires an ordering, which a
    # table does not have.
    T2 = table_rank(T, 'sum')
    table_print(T2)

    # Simple where filter: select * from T2 where c2 < 5.
    T3 = table_filter(T2, ('c2', lambda c: c < 5))
    table_print(T3)

Converting an imperative algorithm that "grows" a table into pure functions

1 Answers1