How to extract links from markdown

Question

I'm trying to parse an input that may either be a hyperlink or a hyperlink in markdown. I can easily check whether it's a hyperlink with ^https?://.+$ and use regexp.Match, but with markdown links it's a whole different rabbit hole for me.

I came across this regex ^\[([\w\s\d]+)\]$(https?:\/\/[\w\d./?=#]+)$$ which I tried to modify to match only the markdown link but after having the last parantheses captured for some reason, I've just been looking at just matching the 2nd capture group, the link, with things like SubexpNames, FindStringIndex, FindSubmatch, Split and so on, but none of them seem to either capture what I'm looking for (sometimes they return the entire string anyway) or most likely I'm doing it wrong.

Here's what I'm looking for:

Input - [https://imgur.com/abc](https://imgur.com/bcd)
Should output the link - https://imgur.com/bcd

Here's my code so far: https://play.golang.org/p/OiJE3TvvVb6

Why do you need a regex at all? Just check if the first character is `[`. — Jonathan Hall, Feb 29 '20 at 10:22
Just to check if it's a valid http link, and then preferably I'd just like to parse the link itself from the markdown link — Keanu, Feb 29 '20 at 10:24
Basically I'm taking the input, which may be a hyperlink or a markdown link, then I'm embedding the link somewhere else with ReplaceAll on another string. ``` toEdit = `*[THE LINK TO THE TEMPLATE](%LINK%)*` toEdit = strings.ReplaceAll(toEdit, "%LINK%", link) ``` — Keanu, Feb 29 '20 at 10:27
Do you need to support relative URLs? i.e. `[Foo](/some/path)`? — Jonathan Hall, Feb 29 '20 at 10:29
And do you have the URLs in isolation, or they're embedded in larger text? i.e. is your input literally `[Foo](https://example.com)`, or is it a full document like `Go to [Foo](https://example.com) for a good time`? — Jonathan Hall, Feb 29 '20 at 10:29
The URLs are only in isolation, this is more of a command sort of thing — Keanu, Feb 29 '20 at 10:34
In that case, just extract everything between the parenthesis. A regex is probably not the best tool, because it's fairly inefficient, but it should work. Something like `$(.*)$` should do it. — Jonathan Hall, Feb 29 '20 at 10:35
Keep in mind that can break if you have parenthesis in your descriptions. I.e. `[Foo (bar)](http://example.com/)` would break. This is one reason a regular expression is the wrong tool for a job like this. The best solution would be to use a proper Markdown parser. — Jonathan Hall, Feb 29 '20 at 10:38

score 4 · Accepted Answer · answered Feb 29 '20 at 23:19

You may use regexp.FindStringSubmatch to get the captured value yielded by your single-URL validating regex:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    markdownRegex := regexp.MustCompile(`^\[[^][]+]\((https?://[^()]+)\)$`)
    results := markdownRegex.FindStringSubmatch("[https://imgur.com/abc](https://imgur.com/bcd)")
    fmt.Printf("%q", results[1])
}

See the GO demo online.

You may consider using regexp.FindAllStringSubmatch to find all occurrences of the links you need:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    markdownRegex := regexp.MustCompile(`\[[^][]+]\((https?://[^()]+)\)`)
    results := markdownRegex.FindAllStringSubmatch("[https://imgur.com/abc](https://imgur.com/bcd) and [https://imgur.com/xyy](https://imgur.com/xyz)", -1)
    for v := range results {fmt.Printf("%q\n", results[v][1])}
}

See the Go lang demo

The pattern means:

\[ - a [ char
[^][]+ - 1+ chars other than [ and ]
]\( - ]( substring
(https?://[^()]+) - Group 1: http, then an optional s, then a :// substring, and then 1+ chars other than ( and )
\) - a ) char.

See the online regex demo.

How to extract links from markdown

1 Answers1