Matching JSON keys with regex in Python

Question

I'm trying to find a regular expression which matches repeated keys on different levels of a nested JSON string representation. All my "solutions" suffer from catastrophic backtracking so far.

An example of that JSON string looks like this:

d = {               
        "a": {      
            "b": {
                "c": {
                    "d": "v1", 
                    "key": "v2"
                }
            },
            "c": {  
                "g": "v3",     
                "key": "v4"
            },
            "key": "v5"        
        }
    }

The value of key is the target. My application does have all object names leading to that key. With these names I can use a for loop to construct my final regex. So basically I need the parts to put in between.

Example: If I get "a" and "key" I could construct the following: "a"[^}]*"key". This matches the first "key" in my string d, the one with value v2.

What should happen though, is that "a" + "key" matches the key with value v5. The key with value v2 should be match when the full path "a" + "b" + "c" + "key" comes in. The last case in this example would be matching the key with value v4 when "a" + "c" + "key" is given.

So a complete regex for the last one would look similar to this:

"a"MATCH_EVERYTHING_IN_BETWEEN_REGEX"c"MATCH_EVERYTHING_IN_BETWEEN_REGEX"key":\s*(\[[^}]*?\]|".*?"|\d+\.*\d*)

To be clear, I am looking for this MATCH_EVERYTHING_IN_BETWEEN_REGEX expression which I can plug in as connectors. This is to make sure it matches only the key I have received the path for. The JSON string could be infinitely nested.

Here is an online regex tester with the example: https://regex101.com/r/yNZ3wo/2

Note: I know this is not python specific but I'm also grateful about python hints in this context. I thought about building my own parser, using a stack and counting { and } but before I would like to make sure there is no easy regex solution.

EDIT: I know about the json library but this doesn't solve my case since I'm tracking the coordinates of my targets within the string representation inside an editor window. I'm not looking for the values themselves, I can access them from an associated dictionary.

Are you trying to get the value of `key` or build a regex? If the former, why not use the `json` module's functions? (Eg `json.load(string_json)`) — , May 26 '18 at 05:03
The reason I need to work with a string representation is that I'm tracking the positions of these values in an editor window. So, it's not about the value, more about the string itself. — loxosceles, May 26 '18 at 05:12
Oooh. That sounds fun! I don't know what say but [maybe this will help](https://stackoverflow.com/questions/26385984/recursive-pattern-in-regex). Best of luck; I'll be coming back here looking for your updates! — , May 26 '18 at 05:40

score 1 · Accepted Answer · answered May 26 '18 at 07:15

This is hard. A possible solution is to use

a recursive regex* to match nested braces
(?<="a": )({(?>[^{}]|(?1))*})
and then, continue the search for the key on the inner level using a trash-can approach, i.e. ignore the overall match and just look at a specific capturing group if it contains a value
_{(here $2, add groups as needed):}
({(?>[^{}]|(?1))*})|"key":\s*"([^"]*?)"

Code sample:

import regex as re

test_str = ("{                   \n"
    "  \"a\": {            \n"
    "    \"b\": {          \n"
    "      \"c\": {        \n"
    "        \"d\": \"v1\",  \n"
    "        \"key\": \"v2\" \n"
    "      }             \n"
    "    },              \n"
    "    \"c\": {          \n"
    "      \"g\": \"v3\",    \n"
    "      \"key\": \"v4\"   \n"
    "    },              \n"
    "    \"key\": \"v5\"  \n"
    "    }     \n"
    "  }                 \n"
    "}                   \n")

regex = r"(?<=\"a\": )({(?>[^{}]|(?1))*})"
innerRegex = r"({(?>[^{}]|(?1))*})|\"key\":\s*\"([^\"]*?)\""

matches = re.finditer(regex, test_str, re.DOTALL)

for n, match in enumerate(matches):
    n = n + 1    
    #print ("Match {n} was found at {start}-{end}: {match}".format(n = n, start = match.start(), end = match.end(), match = match.group()))
    inner = match.group()[1:-1]

    innerMatches = re.finditer(innerRegex, inner, re.DOTALL)
    for m, innerMatch in enumerate(innerMatches):
        #m = m + 1
        if (innerMatch.groups()[1] is not None):          
            print ("Found at {start}-{end}: {group}".format(start = innerMatch.start(2), end = innerMatch.end(2), group = innerMatch.group(2)))

or continue the search on the next level (not shown in the above) code.
Basically, you would continue from the inner match again from step 1 in the same way (see demo), e.g.:

(?<="c": )({(?>[^{}]|(?1))*})

This should give you head-start.

*_{Since we use regex recursion, we need the alternative Python regex package.}

Obviously, you need to refactor and parameterize this approach according to your needs. — wp78de, May 26 '18 at 19:52
I have not build the function which matches a full path and only that, but from the tests I've made I can see that your approach would work. Someone earlier mentioned that regex is probably not the right tool for this though, and I think that is probably correct. I figured out an alternative solution to regex which I will upload later. This is still the best answer to my question, however. — loxosceles, May 27 '18 at 20:58

loxosceles · Answer 2 · 2018-05-29T04:22:36.897

Thanks to the answer provided by wp78de I realized that regex in this case is not the right tool for the job, at least not the one I wanted. Maybe this is of use for someone else, that's why I'm adding this here.

So, I wrote a function which solves the problem recursively.

I made use of the fact that I know which key has to be matched at which level, so it only increments the key index (ind) when this is the case. Other keys which are not matched by name and level together trigger an exception. The if clauses at the end take care of the nesting level.

As a first step I convert the string into a list of lines (with preceding blanks stripped):

d = \
{
    "a": {
        "b": {
            "c": {
                "d": "v1",
                "key": "v2" # line 6
                }
            },
        "x": {
            "c": {
                "d": "v11",
                "key": "v20" # line 12
                }                      
            },                         
        "c": {                         
            "g": "v3",                 
            "key": "v4" # line 17      
            },                         
        "key": "v5" # line 19          
    }                                  
}       

ds = json.dumps(d, indent=4)     

l = ds.split('\n')               
ll = [x.lstrip() for x in l]     


def findkey(l, t, lev=0, ind=0):                                 
    if ind == len(t):                                            
        return 1                                                 
    else:                                                                                                         
        el = l[0]                                                
        try:                                                     
            if el.startswith(t[ind]) and t.index(t[ind]) == lev: 
                ind += 1                                         
        except IndexError as e:                                  
            pass                                                 

        if "{" in el:                                            
            lev += 1                                             
        if "}" in el:                                            
            lev -= 1                                             
        return 1 + findkey(l[1:], t, lev, ind)

The above only returns the line number but now I can match my target with a very simple regex:

idx = findkey(ll[1:], tup) - 1            
s = re.compile(tup[-1] + ': (\s*(\[[^}]*?\]|".*?"|\d+\.*\d*))', re.DOTALL)          
match = s.search(l[idx])  
print("Value found at start index: {}, stop index: {}".format(match.start(1), match.end(2)))

Output:

Value found at start index: 19, stop index: 23

Here is a pyfiddle:

Matching JSON keys with regex in Python

2 Answers2