1

this is my first question!

I have a certain string looking like this (cutout):

"""
"random": [
    {
        "d3rdda": "dasd212dsadsa",
        "author": {
            "email": "john@doe.com",
            "name": "John Doe"
        },
"""

I need to find all authors with the corresponding email adresses, so I want to get every match that starts with

"author": {

and ends with

}

because that would exactly give me

["email": "john@doe.com", "name": "John Doe"]

Sadly, this is not working:

result = re.findall(r'\"author\": {.+}$', ext)
print(result)

I'm very grateful for any support!

  • 2
    That is JSON, use `import json`, parse the JSON and find the `author` keys and grab their values. – Wiktor Stribiżew Feb 21 '22 at 22:13
  • The problem is author is not always nested in the same level – Dr.Zipfeltitte Feb 21 '22 at 22:16
  • It does not matter, JSON is JSON and needs to be parsed with a dedicated library. – Wiktor Stribiżew Feb 21 '22 at 22:18
  • You should also provide a [mre] that is representative of your issue. It's not difficult to parse a json object recursively for all properties named `"author"`. – Pranav Hosangadi Feb 21 '22 at 22:19
  • 1
    I agree json should not be parsed with regex, but just to say why your regex is not working: the "." does not match newlines, you have to put them explicitly in the capture group using \n token (and then also maybe \r). – niid Feb 21 '22 at 22:21
  • An example JSON/string would be somethink like this [link]https://api.github.com/users/rotki/events/public – Dr.Zipfeltitte Feb 21 '22 at 22:28
  • @niid Correct. Another approach would be to use multiline mode: https://stackoverflow.com/questions/587345/regular-expression-matching-a-multiline-block-of-text – Nick ODell Feb 21 '22 at 22:29
  • 1
    @NickODell I was testing it using multiline mode, but the problem is `.` does not match control characters. – niid Feb 21 '22 at 22:44
  • I have never heard about multiline before, I have to dive into this. The tricky is is also the "" in the string/json. I'am able to find the first match with a a lambda-function and finding the first index of the string with the find function, then slicing it. But I think regex would be more appropriate here, and also give me all the matches, not just one. I still appreciate any help. – Dr.Zipfeltitte Feb 21 '22 at 22:46

3 Answers3

0

This seems to work for this example, but every other line would have to have the same format

re.findall(r'\"author\".*\S+\s+.*\S+\s+.*\S+\s+}', ext)
Kirsten_J
  • 96
  • 2
0

This isn't a good application for regex. Instead, you should deserialize it using the json library, and find any dict keys named "author" in the resulting object. This is easy to do using a recursive function:

def find_authors(obj):
    authors = [] # Empty list
    if isinstance(obj, dict): # If obj is a dict, iterate over its keys
        for key in obj:
            if key == "author": # If the key is author, then append it to our return list
                authors.append(obj[key])

            elif isinstance(obj[key], (list, dict)): 
                # Else, if the value is a list or a dict, then look for authors inside it 
                # and extend the original list with the result
                authors.extend(find_authors(obj[key]))

    elif isinstance(obj, list): # Else if it's a list, iterate over its elements
        for elem in obj:
            # Look for authors in each element of the list, and extend the main authors list
            authors.extend(find_authors(elem)) 

    return authors
import urllib.request
import json

r = urllib.request.urlopen("https://api.github.com/users/rotki/events/public")
txt = r.read()
jobj = json.loads(txt)

find_authors(jobj)

Which gives a list containing all "author" entries in the json. Note that this is an actual python list containing dictionaries, not a json string.

[{'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'},
 {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'}]
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • 1
    Hi Pranav, thank you very much for your work, I really appreciate it! I like this non-regex approach, but sadly your code returns an empty list for me. Any idea why? – Dr.Zipfeltitte Feb 22 '22 at 09:24
  • @Dr.Zipfeltitte I get `[{'email': 'yabirg@protonmail.com', 'name': 'Yabir Benchakhtir'}, {'email': 'lefteris@refu.co', 'name': 'Lefteris Karapetsas'}]` with the code. – Wiktor Stribiżew Feb 22 '22 at 10:57
  • Thanks! It works now, I guess my firewall was blocking the response. – Dr.Zipfeltitte Feb 22 '22 at 12:45
0

You may try this.

re.findall('"author": {.+}', ext, re.DOTALL)
Shiping
  • 1,203
  • 2
  • 11
  • 21