3

This question is an extension based on here and here.

What is a good approach to mapping a function to a specified key path in nested dicts, including these path specification:

  1. List of keys at a given path position
  2. Key slices (assuming sorting)
  3. Wildcards (ie all keys at a path position)
  4. Handling ragged hierarchies by ignoring keys that don't appear at a given level

If it is makes is simpler, can assume that only dicts are nested, no lists of dicts, since the former can be obtained with dict(enumerate(...)).

However, the hierarchy can be ragged, eg:

data = {0: {'a': 1, 'b': 2},
 1: {'a': 10, 'c': 13},
 2: {'a': 20, 'b': {'d': 100, 'e': 101}, 'c': 23},
 3: {'a': 30, 'b': 31, 'c': {'d': 300}}}

Would like to be able to specify key path like this:

map_at(f, ['*',['b','c'],'d'])

To return:

{0: {'a': 1, 'b': 2},
     1: {'a': 10, 'c': 13},
     2: {'a': 20, 'b': {'d': f(100), 'e': 101}, 'c': 23},
     3: {'a': 30, 'b': 31, 'c': {'d': f(300)}}}

Here f is mapped to key paths [2,b,d] and [3,c,d].

Slicing would be specified as, eg [0:3,b] for instance.

I think the path spec is unambiguous, though could be generalized to, for example, match key path prefix (in which case, f would also be mapped at [0,b]` and other paths).

Can this be implemented via comprehension and recursion or does it require heavy lifting to catch KeyError etc?

Please do not suggest Pandas as an alternative.

alancalvitti
  • 476
  • 3
  • 14
  • 1
    Anything can be implemented via recursion—exactly what kind of “heavy lifting” are you trying to avoid that includes `try`? – Davis Herring Jan 01 '19 at 01:06
  • @DavisHerring, the primary problem is `KeyError` is thrown in ragged data when one or more branches do not have a specified key, as shown in the example. – alancalvitti Jan 02 '19 at 14:34
  • What if a key path resolves to a `dict`? – Davis Herring Jan 02 '19 at 21:28
  • @DavisHerring, if key path resolves to a dict, it should return it. Do you foresee any ambiguities there? – alancalvitti Jan 03 '19 at 14:52
  • No ambiguity, but does “should return it” mean with or without applying `f`? – Davis Herring Jan 03 '19 at 14:58
  • I see your point, I'm trying to implement getter and mapper for such hierarchies. it's possible that `f` intentionally takes a dict as input; in that case the function should be applied (throw an error otherwise). – alancalvitti Jan 03 '19 at 16:50
  • Slice notation can’t appear except within subscripting brackets. You could use the `slice` *function* or write `['*',slicer[0:3]]` with a convenience object `slicer`. – Davis Herring Jan 03 '19 at 17:09
  • I like: "Please do not suggest Pandas as an alternative." – jferard Feb 13 '19 at 16:30

3 Answers3

1

I'm not a big fan of pseudo-code, but in this kind of situation, you need to write down an algorithm. Here's my understanding of your requirements:

map_at(func, path_pattern, data):

  1. if path_pattern is not empty
    • if data is terminal, it's a failure : we did not match the full path_pattern ̀so there is no reason to apply the function. Just return data.
    • else, we have to explore every path in data. We consume the head of path_pattern if possible. That is return a dict data key -> map_at(func, new_path, data value) where new_path is the tail of the path_pattern if the key matches the head, else the `path_pattern itself.
  2. else, it's a success, because all the path_pattern was consumed:
    • if data is terminal, return func(data)
    • else, find the leaves and apply func: return return a dict data key -> map_at(func, [], data value)

Notes:

  • I assume that the pattern *-b-d matches the path 0-a-b-c-d-e;
  • it's an eager algorithm: the head of the path is always consumed when possible;
  • if the path is fully consumed, every terminal should be mapped;
  • it's a simple DFS, thus I guess it's possible to write an iterative version with a stack.

Here's the code:

def map_at(func, path_pattern, data):
    def matches(pattern, value):
        try:
            return pattern == '*' or value == pattern or value in pattern
        except TypeError: # EDIT: avoid "break" in the dict comprehension if pattern is not a list. 
            return False

    if path_pattern:
        head, *tail = path_pattern
        try: # try to consume head for each key of data
            return {k: map_at(func, tail if matches(head, k) else path_pattern, v) for k,v in data.items()}
        except AttributeError: # fail: terminal data but path_pattern was not consumed
            return data
    else: # success: path_pattern is empty.
        try: # not a leaf: map every leaf of every path
            return {k: map_at(func, [], v) for k,v in data.items()}
        except AttributeError: # a leaf: map it
            return func(data)

Note that tail if matches(head, k) else path_pattern means: consume head if possible. To use a range in the pattern, just use range(...).

As you can see, you never escape from case 2. : if the path_pattern is empty, you just have to map all leaves whatever happens. This is clearer in this version:

def map_all_leaves(func, data):
    """Apply func to all leaves"""
    try:
        return {k: map_all_leaves(func, v) for k,v in data.items()}
    except AttributeError:
        return func(data)

def map_at(func, path_pattern, data):
    def matches(pattern, value):
        try:
            return pattern == '*' or value == pattern or value in pattern
        except TypeError: # EDIT: avoid "break" in the dict comprehension if pattern is not a list. 
            return False

    if path_pattern:
        head, *tail = path_pattern
        try: # try to consume head for each key of data
            return {k: map_at(func, tail if matches(head, k) else  path_pattern, v) for k,v in data.items()}
        except AttributeError: # fail: terminal data but path_pattern is not consumed
            return data
    else:
        map_all_leaves(func, data)

EDIT

If you want to handle lists, you can try this:

def map_at(func, path_pattern, data):
    def matches(pattern, value):
        try:
            return pattern == '*' or value == pattern or value in pattern
        except TypeError: # EDIT: avoid "break" in the dict comprehension if pattern is not a list. 
            return False

    def get_items(data):
        try:
            return data.items()
        except AttributeError:
            try:
                return enumerate(data)
            except TypeError:
                raise

    if path_pattern:
        head, *tail = path_pattern
        try: # try to consume head for each key of data
            return {k: map_at(func, tail if matches(head, k) else path_pattern, v) for k,v in get_items(data)}
        except TypeError: # fail: terminal data but path_pattern was not consumed
            return data
    else: # success: path_pattern is empty.
        try: # not a leaf: map every leaf of every path
            return {k: map_at(func, [], v) for k,v in get_items(data)}
        except TypeError: # a leaf: map it
            return func(data)

The idea is simple: enumerate is the equivalent for a list of dict.items:

>>> list(enumerate(['a', 'b']))
[(0, 'a'), (1, 'b')]
>>> list({0:'a', 1:'b'}.items())
[(0, 'a'), (1, 'b')]

Hence, get_items is just a wrapper to return the dict items, the list items (index, value) or raise an error.

The flaw is that lists are converted to dicts in the process:

>>> data2 = [{'a': 1, 'b': 2}, {'a': 10, 'c': 13}, {'a': 20, 'b': {'d': 100, 'e': 101}, 'c': 23}, {'a': 30, 'b': 31, 'c': {'d': 300}}]
>>> map_at(type,['*',['b','c'],'d'],data2)
{0: {'a': 1, 'b': 2}, 1: {'a': 10, 'c': 13}, 2: {'a': 20, 'b': {'d': <class 'int'>, 'e': 101}, 'c': 23}, 3: {'a': 30, 'b': 31, 'c': {'d': <class 'int'>}}}

EDIT

Since you are looking for something like Xpath for JSON, you could try https://pypi.org/project/jsonpath/ or https://pypi.org/project/jsonpath-rw/. (I did not test those libs).

jferard
  • 7,835
  • 2
  • 22
  • 35
  • I expected it work with any nested combinations of `list` and `dict`. Given `data2 = [{'a': 1, 'b': 2}, {'a': 10, 'c': 13}, {'a': 20, 'b': {'d': 100, 'e': 101}, 'c': 23}, {'a': 30, 'b': 31, 'c': {'d': 300}}]`, `map_at(type,['*',['b','c'],'d'],data2)` returns the input, but the top-level wildcard should map over all dicts in the list. Also tried slices like `0:2` and `:` in place of the wildcard, resulting in syntax errors. – alancalvitti Feb 14 '19 at 21:04
  • @alancalvitti I did not implement slices, but you can use `range(0,2)` instead of `0:2`. For the lists, I assumed you had only dicts. If the lists are only at top level, it's easy to fix, but if you have values of dicts that are lists or lists inside lists, it becomes more complex and you will have to check the type of the elements. – jferard Feb 14 '19 at 21:17
  • @alancalvitti Are you trying to implement something like https://en.wikipedia.org/wiki/XPath/https://en.wikipedia.org/wiki/XSLT over JSON data? If this the case, you'd better try an existing library. See my edit for lists. – jferard Feb 15 '19 at 18:23
  • thanks, yes I should really post an expanded Q about general nested combinations of lists, dicts, tuples, and any other container object. And yes I'm aware of `enumerate` but it should only be used internally - the type should remain invariant. I'm not sure XLST will work as it is based on XML, but the general idea is the same, modulo that keys may be, say tuples, so would need to disambiguate. I use Wolfram Language a lot, which has exactly this type of functionality (both query in place and getters) - and in fact dictionary keys can be arbitrary expressions (mutable). – alancalvitti Feb 16 '19 at 19:16
  • @alancalvitti I think you can achieve what you want in Python, but you won't get a full code answer on SO, since that's a complex problem. See my new edit for a library. – jferard Feb 16 '19 at 19:47
  • thanks will check it out. But you can't expect to do in Py what you can do in WL, which is not only functional vs OO, but also b/c of its symbolic processing and pattern matching. Python is like assembly language, relatively speaking. – alancalvitti Feb 17 '19 at 20:13
  • This doesn't work even though `0` is a key: `map_at(type,[0,['b','c'],'d'],data)` but it works if it's in a list `[0,0]`: `map_at_(type,[[0,0],['b','c'],'d'],data)`. Can you explain? – alancalvitti Feb 21 '19 at 17:02
  • in my previous comment, substitute `2` for `0` - the results are the same as I described (I had used a modified `data`). – alancalvitti Feb 21 '19 at 17:16
  • I can explain: it's a bug! See my edit. If you use a list, e.g. `[2]`, the test `value in pattern` returns `True`. If you use an int, `2`, the same test raises a `TypeError` that breaks the dict comprehension if not catched. It's fixed now. – jferard Feb 21 '19 at 20:17
  • @ok now, but still don't like that it changes `list` to `dict`. I'll try to mod by branching on type. I have a more expansive agenda for `query` and `query_at` functions that can also take a list of functions (returning a list of results), and pipelines of functions. Best to ask as separate Q. – alancalvitti Feb 21 '19 at 22:26
  • re [jsonpath-rw](https://pypi.org/project/jsonpath-rw/), from the docs, unfortunately: "array slicing (note that step is unimplemented only due to lack of need thus far)" – alancalvitti Feb 25 '19 at 15:51
  • @alancalvitti "**step** is unimplemented": you can have slices, but the step between elements is always 1 (e.g. no [1,3,5,7] slice). I'm not sure `jsonpath-rw` has all the features you need, though. *Best to ask as separate Q* I'm not sure that's a good idea since you are not looking for the solution of a specific problem, but you need to find (or write) a full library. – jferard Feb 25 '19 at 16:47
  • With `map_at(type,['*','b'], data)` this method will recurse and apply it to deeper levels, eg `'b': {'d': int, 'e': int}` rather than return `b: int` as if path expression was, say `['*','b','*']`. – alancalvitti Feb 25 '19 at 20:27
0

This is not very simple and less efficient, but it should work:

def map_at(f,kp,d): return map_at0(f,kp,d,0)
def slice_contains(s,i):  # no negative-index support
  a=s.start or 0
  return i>=a and (s.end is None or i<s.end) and\
    not (i-a)%(s.step or 1)
def map_at0(f,kp,d,i):
  if i==len(kp): return f(d)
  if not isinstance(d,dict): return d  # no such path here
  ret={}
  p=kp[i]
  if isinstance(p,str) and p!='*': p=p,
  for j,(k,v) in enumerate(sorted(d.items())):
    if p=='*' or (slice_contains(p,j) if isinstance(p,slice) else k in p):
      v=map_at0(f,kp,v,i+1)
    ret[k]=v
  return ret

Note that this copies every dictionary it expands (because it matches the key path, even if no further keys match and f is never applied) but returns unmatched subdictionaries by reference. Note also that '*' can be “quoted” by putting it in a list.

Davis Herring
  • 36,443
  • 4
  • 48
  • 76
0

I think you might appreciate this refreshing generator implementation -

def select(sel = [], d = {}, res = []):

  # (base case: no selector)
  if not sel:                   
    yield (res, d)

  # (inductive: a selector) non-dict
  elif not isinstance(d, dict): 
    return

  # (inductive: a selector, a dict) wildcard selector
  elif sel[0] == '*':           
    for (k, v) in d.items():
      yield from select \
        ( sel[1:]
        , v
        , [*res, k]
        )

  # (inductive: a selector, a dict) list selector
  elif isinstance(sel[0], list):
    for s in sel[0]:
      yield from select \
        ( [s, *sel[1:]]
        , d
        , res
        )

  # (inductive: a selector, a dict) single selector
  elif sel[0] in d:             
    yield from select \
      ( sel[1:]
      , d[sel[0]]
      , [*res, sel[0]]
      )

  # (inductive: single selector not in dict) no match
  else:                         
    return

It works like this -

data = \
  { 0: { 'a': 1, 'b': 2 }
  , 1: { 'a': 10, 'c': 13 }
  , 2: { 'a': 20, 'b': { 'd': 100, 'e': 101 }, 'c': 23 }
  , 3: { 'a': 30, 'b': 31, 'c': { 'd': 300 } }
  }

for (path, v) in select(['*',['b','c'],'d'], data):
  print(path, v)

# [2, 'b', 'd'] 100
# [3, 'c', 'd'] 300

Because select returns an iterable, you can use conventional map function on it -

s = select(['*',['b','c'],'d'], data)

work = lambda r: f"path: {r[0]}, value: {r[1]}"

for x in map(work, s):
  print(x)

# path: [2, 'b', 'd'], value: 100
# path: [3, 'c', 'd'], value: 300
Mulan
  • 129,518
  • 31
  • 228
  • 259
  • thanks, it's an interesting approach to use generators. However, in this particular Q I'm looking for a `map_at` which should return the entire input with the function mapped at the leaf nodes of the path expressions. You extracts these instead and incorrectly. Mapping eg `type` using your method: `list(map(type,select(['*',['b','c'],'d'], data)))` incorrectly returns `[tuple, tuple]`. It should return `[int,int]`. Similarly, `lambda x:x+1` throws a `TypeError` when it should return `[101,301]` – alancalvitti Feb 22 '19 at 16:00