rebuilding arrays with nested defaultdict

Question

This question is an extension of a previous question: rebuild python array based on common elements - but different enough to warrant a new question:

I've been struggling with this for a bit now. My data is an array of dictionaries from an sql query. Each element in the array represents a shipment, and there are common values based on the keys.

data = [
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0001", "qty":"10", "memo":"blah1"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0002", "qty":"10", "memo":"blah2"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0003", "qty":"10", "memo":"blah3"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0004", "qty":"20", "memo":"blah4"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0005", "qty":"20", "memo":"blah5"},
    {"CustName":"customer3", "PartNum":"partXYZ", "delKey":"0006", "qty":"50", "memo":"blah6"},
    {"CustName":"customer3", "PartNum":"partABC", "delKey":"0007", "qty":"100", "memo":"blah7"}]

The output I want is grouped according to specific keys

dataOut = [
   {"CustName":"customer1", "Parts":[
        {"PartNum":"part1", "deliveries":[
            {"delKey":"0001", "qty":"10", "memo":"blah1"},
            {"delKey":"0002", "qty":"10", "memo":"blah2"},
            {"delKey":"0003", "qty":"10", "memo":"blah3"}]}]},
   {"CustName":"customer2", "Parts":[
        {"PartNum":"part3", "deliveries":[
            {"delKey":"0004", "qty":"20", "memo":"blah4"},
            {"delKey":"0005", "qty":"20", "memo":"blah5"}]}]},
   {"CustName":"customer3", "Parts":[
        {"PartNum":"partXYZ", "deliveries":[
            {"delKey":"0006", "qty":"50", "memo":"blah6"}]},
        {"PartNum":"partABC", "deliveries":[
            {"delKey":"0007", "qty":"100", "memo":"blah7"}]}]}]

I can get the grouping with a single level using defaultdict and list comprehension as provided by the previous question and modified slightly

d = defaultdict(list)
for item in data:
    d[item['CustName']].append(item)
print([{'CustName': key, 'parts': value} for key, value in d.items()])

But I can't seem to get the second level in the output array - the grouping b the PartNum key. Through some research, I think what I need to do is use defaultdict as the type of the outer `defaultdict' like so:

d = defaultdict(defaultdict(list))

which throws errors because defaultdict returns a function, so I need to use lambda (yes?)

d = defaultdict(lambda:defaultdict(list))
for item in data:
    d[item['CustName']].append(item) <----this?

My question is how to "access" the second level array in the loop and tell the "inner" defaultdict what to group on (PartNum)? The data comes to me from the database programmer and the project keeps evolving to add more and more data (keys), so I'd like this solution to be as general as possible in case more data gets thrown my way. I was hoping to be able to "chain" the defaultdicts depending on how many levels I need to go. I'm learning as I'm going, so I'm struggling trying to understand the lambda and the basics of the defaultdict type and where to go from here.

Sort the dict then apply [groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby). Or do it beforehand with SQL if you can. I cannot help you more now, I am on a mobile phone... — Pynchia, Jan 04 '16 at 01:31
Can there be two ```delKey```'s with the same number/value for a ```PartNum```? — wwii, Jan 04 '16 at 02:27
Do you care about the order of the values in the lists in your output? If not, you could easily get rid of those levels and make your structure just a nested set of dictionaries. `Tree = lambda: defaultdict(Tree)` is all the setup you need for that kind of structure. — Blckknght, Jan 04 '16 at 05:15

cr3 · Accepted Answer · 2016-01-04T03:32:55.390

Using groupby as suggested by @Pynchia and using sorted for unordered data as suggested by @hege_hegedus:

from itertools import groupby
dataOut = []
dataSorted = sorted(data, key=lambda x: (x["CustName"], x["PartNum"]))
for cust_name, cust_group in groupby(dataSorted, lambda x: x["CustName"]):
    dataOut.append({
        "CustName": cust_name,
        "Parts": [],
    })
    for part_num, part_group in groupby(cust_group, lambda x: x["PartNum"]):
        dataOut[-1]["Parts"].append({
            "PartNum": part_num,
            "deliveries": [{
                "delKey": delivery["delKey"],
                "memo": delivery["memo"],
                "qty": delivery["qty"],
            } for delivery in part_group]
        })

If you look at the second for loop, this will hopefully answer your question about accessing the second level array in the loop.

This method seemed to work the best for me to get the desired output. I wanted to use the `tree` method, but. I was not able to get lists inside the tree. — guidoc, Jan 05 '16 at 03:36

score 2 · Answer 2 · edited May 23 '17 at 12:04

You could use a tree-like data structure based on an OrderedDefaultdict instead of a defaultdict(list). (The definition's from an unrelated answer of mine.)

from collections import OrderedDict

class OrderedDefaultdict(OrderedDict):
    def __init__(self, *args, **kwargs):
        if not args:
            self.default_factory = None
        else:
            if not (args[0] is None or callable(args[0])):
                raise TypeError('first argument must be callable or None')
            self.default_factory = args[0]
            args = args[1:]
        super(OrderedDefaultdict, self).__init__(*args, **kwargs)

    def __missing__ (self, key):
        if self.default_factory is None:
            raise KeyError(key)
        self[key] = default = self.default_factory()
        return default

Tree = lambda: OrderedDefaultdict(Tree)

d = Tree()
for rec in data:
    custName, partNum, delKey = rec['CustName'], rec['PartNum'], rec['delKey']
    details = {"qty": rec["qty"], "memo": rec["memo"]}
    d[custName]['Parts'][partNum]['deliveries'][delKey] = details

So, for the data shown in your question, d would end up containing:

d = {
    "customer1": {
        "Parts": {
            "part1": {
                "deliveries": {"0001": {"memo": "blah1", "qty": "10"},
                               "0002": {"memo": "blah2", "qty": "10"},
                               "0003": {"memo": "blah3", "qty": "10"}}}}},
    "customer2": {
        "Parts": {
            "part3": {
                "deliveries": {"0004": {"memo": "blah4", "qty": "20"},
                               "0005": {"memo": "blah5", "qty": "20"}}}}},
    "customer3": {
        "Parts": {
            "partXYZ": {
                "deliveries": {"0006": {"memo": "blah6", "qty": "50"}}},
            "partABC": {
                "deliveries": {"0007": {"memo": "blah7", "qty": "100"}}}}}
}

Which could just simply be printed out since it's now grouped the way you want.

I was wondering if a custom object would be the right approach - make it easier to accomplish. thnx — wwii, Jan 04 '16 at 03:21
Note that, as @Blckknght [said](http://stackoverflow.com/questions/34583525/rebuilding-arrays-with-nested-defaultdict#comment56915353_34583525), if you don't need to preserve the order of the items in `data`, you wouldn't need to define an `OrderedDefaultdict` and could just use `Tree = lambda: defaultdict(Tree)` instead. Using either dictionary-based data-structure would likely be faster than using a list-based one if there are many items to process. — martineau, Jan 04 '16 at 11:26
This is very close to what I need, however, having a `dict` at each level makes further iterations problematic. For me, logically it should be `Parts:[...]`, not `Parts:{...}` since each customer has a grouping of parts. This whole mess of data is getting passed along to an angular front end - which expects an array. — guidoc, Jan 04 '16 at 15:28
It would be kludgy, but you could probably make the `__missing__ ()` method check the value of the `key` and return an empty `list` whenever it's `'Parts'` instead of calling `self.default_factory()`. Implementing it better would require at least abstracting what the special key was rather than hardcoding it. Essentially, you'd be defining which key indicates that a tree "leaf" was wanted rather than a "branch" or node. — martineau, Jan 04 '16 at 16:34

Tamas Hegedus · Answer 3 · 2016-01-05T10:49:51.990

This is the prettiest way I could do it. It uses the same defaultdict idea to implement proper grouping, as python's builtin groupby function only works on ordered data.

Note that this version will mutate the items in the input dataset, so the leaf items in the result are the same dict instances as the input, but with "CustName" and "PartNum" entries deleted.

from collections import defaultdict

def groupby_mutate(seq, key):
  d = defaultdict(list)
  for item in seq:
    d[item[key]].append(item)
    del item[key]
  return d

def your_operation(data):
  return [ {
    'CustName': CustName,
    'Parts': [ { 
      'PartNum': PartNum,
      'deliveries': deliveries
    } for PartNum,deliveries in groupby_mutate(custItems, 'PartNum').items() ]
  } for CustName,custItems in groupby_mutate(data, 'CustName').items() ]


# try it
from pprint import *
data = [
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0001", "qty":"10", "memo":"blah1"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0002", "qty":"10", "memo":"blah2"},
    {"CustName":"customer1", "PartNum":"part1", "delKey":"0003", "qty":"10", "memo":"blah3"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0004", "qty":"20", "memo":"blah4"},
    {"CustName":"customer2", "PartNum":"part3", "delKey":"0005", "qty":"20", "memo":"blah5"},
    {"CustName":"customer3", "PartNum":"partXYZ", "delKey":"0006", "qty":"50", "memo":"blah6"},
    {"CustName":"customer3", "PartNum":"partABC", "delKey":"0007", "qty":"100", "memo":"blah7"}
]

pprint(your_operation(data))

EDIT:

Just in the case somebody needs it in the future, here is a version that does not mutate the original data:

from collections import defaultdict

def groupby_getitem(seq, key):
  d = defaultdict(list)
  for item in seq:
    d[item[key]].append(item)
  return d

def your_operation(data):
  return [ {
    'CustName': CustName,
    'Parts': [ { 
      'PartNum': PartNum,
      'deliveries': [ dict(
        (k,v) for k,v in delivery.items() if not k in ['CustName', 'PartNum']
      ) for delivery in deliveries ]
    } for PartNum,deliveries in groupby_getitem(custItems, 'PartNum').items() ]
  } for CustName,custItems in groupby_getitem(data, 'CustName').items() ]

Could there be consequences to mutating the original data? Is the function ```groupby_mutate``` accomplishing its task through a side affect? — wwii, Jan 04 '16 at 03:19
Yes it is. The `groupby_mutate` function is not really reusable, it is designed specially to fit in this situation, I hardly doubt it could be used elsewhere. — Tamas Hegedus, Jan 04 '16 at 03:21
@wwii: If it has consequences depends on the context. For data directly coming from a service it should be fine, if the original data instance is not used by another component of the software — Tamas Hegedus, Jan 04 '16 at 03:22
I did not want to mutate the data. I would have to add some tests to verify the data hasn't been corrupted. — guidoc, Jan 05 '16 at 03:38

wwii · Answer 4 · 2016-01-04T04:46:32.827

Sort by "CustName", "PartNum", "delKey". Iterate over the delivery items for each part, for each customer and accumulate to match your output spec.

I like to use operator.itemgetter - for me it makes things clearer.

import collections, itertools, operator

cust_name = operator.itemgetter('CustName')
part_num = operator.itemgetter('PartNum')
group_sort = operator.itemgetter('CustName', 'PartNum', 'delKey')
del_key = operator.itemgetter('delKey')
qty = operator.itemgetter('qty')
memo = operator.itemgetter('memo')


# sort on the relavent keys
data.sort(key = group_sort)
result = []

# iterate over customers
for custname, group1 in itertools.groupby(data, cust_name):
    cust_dict = {'CustName' : custname, 'Parts': []}
    # iterate over parts for this customer
    for partnum, group2 in itertools.groupby(group1, part_num):
        part_dict = {"PartNum" : partnum, 'deliveries' : []}
        # iterate over delivery items for this part
        for thing in group2:
            part_dict['deliveries'].append({'delKey':del_key(thing),
                                            'qty':qty(thing),
                                            'memo':memo(thing)})
        cust_dict['Parts'].append(part_dict)
    result.append(cust_dict)

This clearly iterates over the items in the original data multiple times which may be a performance hit -- but I don't see a way around multiple iteration for what you need to do.

rebuilding arrays with nested defaultdict

4 Answers4