Python RegEx finding all matches between certain strings

Question

this is my first question!

I have a certain string looking like this (cutout):

"""
"random": [
    {
        "d3rdda": "dasd212dsadsa",
        "author": {
            "email": "john@doe.com",
            "name": "John Doe"
        },
"""

I need to find all authors with the corresponding email adresses, so I want to get every match that starts with

"author": {

and ends with

}

because that would exactly give me

["email": "john@doe.com", "name": "John Doe"]

Sadly, this is not working:

result = re.findall(r'\"author\": {.+}$', ext)
print(result)

I'm very grateful for any support!

That is JSON, use `import json`, parse the JSON and find the `author` keys and grab their values. — Wiktor Stribiżew, Feb 21 '22 at 22:13
The problem is author is not always nested in the same level — Dr.Zipfeltitte, Feb 21 '22 at 22:16
It does not matter, JSON is JSON and needs to be parsed with a dedicated library. — Wiktor Stribiżew, Feb 21 '22 at 22:18
You should also provide a [mre] that is representative of your issue. It's not difficult to parse a json object recursively for all properties named `"author"`. — Pranav Hosangadi, Feb 21 '22 at 22:19
I agree json should not be parsed with regex, but just to say why your regex is not working: the "." does not match newlines, you have to put them explicitly in the capture group using \n token (and then also maybe \r). — niid, Feb 21 '22 at 22:21
An example JSON/string would be somethink like this [link]https://api.github.com/users/rotki/events/public — Dr.Zipfeltitte, Feb 21 '22 at 22:28
@niid Correct. Another approach would be to use multiline mode: https://stackoverflow.com/questions/587345/regular-expression-matching-a-multiline-block-of-text — Nick ODell, Feb 21 '22 at 22:29
@NickODell I was testing it using multiline mode, but the problem is `.` does not match control characters. — niid, Feb 21 '22 at 22:44
I have never heard about multiline before, I have to dive into this. The tricky is is also the "" in the string/json. I'am able to find the first match with a a lambda-function and finding the first index of the string with the find function, then slicing it. But I think regex would be more appropriate here, and also give me all the matches, not just one. I still appreciate any help. — Dr.Zipfeltitte, Feb 21 '22 at 22:46

score 0 · Answer 1 · answered Feb 21 '22 at 23:20

0

This seems to work for this example, but every other line would have to have the same format

re.findall(r'\"author\".*\S+\s+.*\S+\s+.*\S+\s+}', ext)

answered Feb 21 '22 at 23:20

Kirsten_J

96
2

Thank you! This works with the cutout of my API response, but not with the full response. – Dr.Zipfeltitte Feb 22 '22 at 09:02

score 0 · Accepted Answer · answered Feb 21 '22 at 23:31

This isn't a good application for regex. Instead, you should deserialize it using the json library, and find any dict keys named "author" in the resulting object. This is easy to do using a recursive function:

def find_authors(obj):
    authors = [] # Empty list
    if isinstance(obj, dict): # If obj is a dict, iterate over its keys
        for key in obj:
            if key == "author": # If the key is author, then append it to our return list
                authors.append(obj[key])

            elif isinstance(obj[key], (list, dict)): 
                # Else, if the value is a list or a dict, then look for authors inside it 
                # and extend the original list with the result
                authors.extend(find_authors(obj[key]))

    elif isinstance(obj, list): # Else if it's a list, iterate over its elements
        for elem in obj:
            # Look for authors in each element of the list, and extend the main authors list
            authors.extend(find_authors(elem)) 

    return authors

import urllib.request
import json

r = urllib.request.urlopen("https://api.github.com/users/rotki/events/public")
txt = r.read()
jobj = json.loads(txt)

find_authors(jobj)

Which gives a list containing all "author" entries in the json. Note that this is an actual python list containing dictionaries, not a json string.

[{'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'}]

Hi Pranav, thank you very much for your work, I really appreciate it! I like this non-regex approach, but sadly your code returns an empty list for me. Any idea why? — Dr.Zipfeltitte, Feb 22 '22 at 09:24
@Dr.Zipfeltitte I get `[{'email': 'yabirg@protonmail.com', 'name': 'Yabir Benchakhtir'}, {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'}]` with the code. — Wiktor Stribiżew, Feb 22 '22 at 10:57
Thanks! It works now, I guess my firewall was blocking the response. — Dr.Zipfeltitte, Feb 22 '22 at 12:45

score 0 · Answer 3 · answered Feb 22 '22 at 00:57

0

You may try this.

re.findall('"author": {.+}', ext, re.DOTALL)

answered Feb 22 '22 at 00:57

Shiping

1,203
2
11
21

Thank you! This works with the cutout of my API response, but not with the full response. – Dr.Zipfeltitte Feb 22 '22 at 09:01
@Dr.Zipfeltitte can you post the cases that don't work? i tried it and got all cases of "author": {...}. – Shiping Feb 22 '22 at 12:47

Python RegEx finding all matches between certain strings

3 Answers3