3

What is the best solution to pivot/cross-tab tables in Python 3? Is there a built-in function that will do this? Ideally, I'm looking for a Python 3 solution that does not have external dependencies. For example, given a nested list:

nl = [["apples", 2 "New York"], 
      ["peaches", 6, "New York"],
      ["apples", 6, "New York"],
      ["peaches", 1, "Vermont"]]  

I would like to be able to rearrange rowed data and groupby fields:

             apples    peaches
New York        2         6
Vermont         6         1

The above is a trivial example, but is there a solution that would be easier than using itertools.groupby everytime a pivot is desired? Ideally, the solution would allow rowed data to be pivoted on any column. I was debating about using pandas, but it is an external library and only has limited Python 3 support.

drbunsen
  • 10,139
  • 21
  • 66
  • 94
  • "easier than using itertools.groupby"? What's wrong with that? Can you be specific on what you don't like about it? "but it is an external "? How is that bad? Can you be more clear on why the solutions are unacceptable? – S.Lott Jan 20 '12 at 21:53
  • If I have a dictionary with a nested list of values, I dont feel it's entirely trivial to pull out a specific value and pivot on this. – drbunsen Jan 20 '12 at 22:24
  • By "easier" I was thinking of the many statistical languages like SAS or R which which have built-in functions that simplify this process; for example, R's plyr and reshape packages. – drbunsen Jan 21 '12 at 13:21

2 Answers2

1

Here is some simple code. Providing row/column/grand totals is left as an exercise for the reader.

class CrossTab(object):

    def __init__(
        self,
        missing=0, # what to return for an empty cell.
                   # Alternatives: '', 0.0, None, 'NULL'
        ):
        self.missing = missing
        self.col_key_set = set()
        self.cell_dict = {}
        self.headings_OK = False

    def add_item(self, row_key, col_key, value):
        self.col_key_set.add(col_key)
        try:
            self.cell_dict[row_key][col_key] += value
        except KeyError:
            try:
                self.cell_dict[row_key][col_key] = value
            except KeyError:
                self.cell_dict[row_key] = {col_key: value}

    def _process_headings(self):
        if self.headings_OK:
            return
        self.row_headings = list(sorted(self.cell_dict.keys()))
        self.col_headings = list(sorted(self.col_key_set))
        self.headings_OK = True

    def get_col_headings(self):
        self._process_headings()
        return self.col_headings

    def generate_row_info(self):
        self._process_headings()
        for row_key in self.row_headings:
            row_dict = self.cell_dict[row_key]
            row_vals = [
                row_dict.get(col_key, self.missing)
                for col_key in self.col_headings
                ]
            yield row_key, row_vals

if __name__ == "__main__":

    data = [["apples", 2, "New York"], 
      ["peaches", 6, "New York"],
      ["apples", 6, "New York"],
      ["peaches", 1, "Vermont"]]  

    ctab = CrossTab(missing='uh-oh')
    for s in data:
        ctab.add_item(row_key=s[2], col_key=s[0], value=s[1])
    print()
    print('Column headings:', ctab.get_col_headings())
    for row_heading, row_values in ctab.generate_row_info():
        print(repr(row_heading), row_values)

Output:

Column headings: ['apples', 'peaches']
'New York' [8, 6]
'Vermont' ['uh-oh', 1]

See also this answer.

And this one, which I'd forgotten about.

Community
  • 1
  • 1
John Machin
  • 81,303
  • 11
  • 141
  • 189
0

itertools.groupby was exactly made for this problem. You will be hard-pressed to find something better, especially within the standard library.

bukzor
  • 37,539
  • 11
  • 77
  • 111