4

I currently have the Python code for parsing markdown text in order to extract the content inside the square brackets of a markdown link along with the hyperlink.

import re

# Extract []() style links
link_name = "[^]]+"
link_url = "http[s]?://[^)]+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'

for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
    name = match[0]
    url = match[1]
    print(url)
    # url will be https://wiki.com/atopic_(subtopic

This will fail to grab the proper link because it matches up to the first bracket, rather than the last one.

How can I make the regex respect up till the final bracket?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • It is confusing. Could you please provide a full [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) ? – Whole Brain Jun 11 '21 at 17:02
  • 1
    There's no way to handle that. Remember that `https://www.silly.com/abc))))` is a perfectly valid URL. Users will have to encode them as %29. Even Typora doesn't handle embedded right parens. – Tim Roberts Jun 11 '21 at 17:03
  • @TimRoberts there is no right here, just to be clear. It's ambiguous. Period. – muzzletov Jun 11 '21 at 17:04
  • The only way in this case would be to use a stack and store the amount of open parentheses. But then it may cause issues in other instances as @TimRoberts pointed out. – muzzletov Jun 11 '21 at 17:07
  • It's fairly well know that HTML is not a regular language and hence cannot be parsed by regular expressions — so you will probably need to use a module like beautifulsoup. – martineau Jun 11 '21 at 18:34
  • I have updated with a minimal code example and tried to clarify the question. – James Bradbury Jun 11 '21 at 18:53

2 Answers2

5

For those types of urls, you'd need a recursive approach which only the newer regex module supports:

import regex as re

data = """
It's very easy to make some words **bold** and other words *italic* with Markdown. 
You can even [link to Google!](http://google.com)
[a link](https://www.wiki.com/atopic_(subtopic))
"""

pattern = re.compile(r'\[([^][]+)\](\(((?:[^()]+|(?2))+)\))')

for match in pattern.finditer(data):
    description, _, url = match.groups()
    print(f"{description}: {url}")

This yields

link to Google!: http://google.com
a link: https://www.wiki.com/atopic_(subtopic)

See a demo on regex101.com.


This cryptic little beauty boils down to

\[([^][]+)\]           # capture anything between "[" and "]" into group 1
(\(                    # open group 2 and match "("
    ((?:[^()]+|(?2))+) # match anything not "(" nor ")" or recurse group 2
                       # capture the content into group 3 (the url)
\))                    # match ")" and close group 2

NOTE: The problem with this approach is that it fails for e.g. urls like

[some nasty description](https://google.com/()
#                                          ^^^

which are surely totally valid in Markdown. If you're to encounter any such urls, use a proper parser instead.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

I think you need to distinguish between what makes a valid link in markdown, and (optionally) what is a valid url. Valid links in markdown can, for example, also be relative paths, and urls may or may not have the 'http(s)' or the 'www' part.

Your code would already work by simply using link_url = "http[s]?://.+" or even link_url = ".*". It would solve the problem of urls ending with brackets, and would simply mean that you rely on the markdown structure []() to find links . Validating urls is an entirely different discussion: How do you validate a URL with a regular expression in Python?

Example code fix:

import re

# Extract []() style links
link_name = "[^\[]+"
link_url = "http[s]?://.+"
markup_regex = f'\[({link_name})]\(\s*({link_url})\s*\)'

for match in re.findall(markup_regex, '[a link](https://www.wiki.com/atopic_(subtopic))'):
    name = match[0]
    url = match[1]
    print(url)
    # url will be https://wiki.com/atopic_(subtopic)

Note that I also adjusted link_name, to prevent problems with a single '[' somewhere in the markdown text.

Christiaan Herrewijn
  • 1,031
  • 10
  • 11