How can I write a javascript regular expression to replace hyperlinks in this format []() with html hyperlinks?

Question

I need the parse text with links in the following formats:

[html title](http://www.htmlpage.com)
http://www.htmlpage.com
https://i.stack.imgur.com/rDDPu.jpg

The output for those two strings would be:

<a href='http://www.htmlpage.com'>html title</a>
<a href='http://www.htmlpage.com'>http://www.htmlpage.com</a>
<a href='https://i.stack.imgur.com/rDDPu.jpg'>https://i.stack.imgur.com/rDDPu.jpg</a>

The string could include an arbitrary amount of these links, ie:

[html title](http://www.htmlpage.com)[html title](http://www.htmlpage.com)
[html title](http://www.htmlpage.com)   [html title](http://www.htmlpage.com)
[html title](http://www.htmlpage.com) wejwelfj http://www.htmlpage.com

output:

<a href='http://www.htmlpage.com'>html title</a><a href='http://www.htmlpage.com'>html title</a>
<a href='http://www.htmlpage.com'>html title</a>    <a href='http://www.htmlpage.com'>html title</a>
<a href='http://www.htmlpage.com'>html title</a> wejwelfj <a href='http://www.htmlpage.com'>http://www.htmlpage.com</a>

I have an extremely long function that does an alright job by passing over the string 3 times, but I can't successfully parse this string:

[This](http://i.imgur.com/iIlhrEu.jpg) one got me crying first, then once the floodgates were opened [this](http://i.imgur.com/IwSNFVD.jpg) one did it again and [this](http://i.imgur.com/hxIwPKJ.jpg). Ugh, feels. Gotta go hug someone/something.

For brevity, I'll post the regular expressions I've tried rather than the entire find/replace function:

var matchArray2 = inString.match(/\[.*\]\(.*\)/g);

for matching [*](*), doesn't work because []()[]() is matched

Really that's it, I guess. Once I make that match I search that match for () and [] to parse out the link an link text and build the href tag. I delete matches from a temp string so I don't match them when I do my second pass to find plain hyperlinks:

var plainLinkArray = tempString2.match(/http\S*:\/\/\S*/g);

I'm not parsing any html with regex. I'm parsing a string and attempting to output html.

edit: I added the requirement that it parse the third link https://i.stack.imgur.com/rDDPu.jpg after the fact.

my final solution (based on @Cerbrus's answer):

function parseAndHandleHyperlinks(inString)
{
    var result = inString.replace(/\[(.+?)\]\((https?:\/\/.+?)\)/g, '<a href="$2">$1</a>');
    return result.replace(/(?: |^)(https?\:\/\/[a-zA-Z0-9/.(]+)/g, ' <a href="$1">$1</a>');     
}

[What have you tried](http://whathaveyoutried.com)? As many ppl here will tell you, parsing HTML with regex... that way madness lies, [as you can see here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) sure, if the only markup you have to deal with it's possible, but do look into the alternatives — Elias Van Ootegem, Jan 30 '13 at 08:00
I couldn't possibly think of a place where [that](http://stackoverflow.com) would be useful... — jahroy, Jan 30 '13 at 08:12
@jahroy: Have you seen how urls are made on here? Let me give you a hint: `[title](url)` or `[title][1] <....> [1]:url`. Parsers like this are useful on forums and other community sites like that. — Cerbrus, Jan 30 '13 at 08:13
Also, @EliasVanOotegem: there's a difference between trying to interpret a HTML document, and trying to parse one specific format into HTML. — Cerbrus, Jan 30 '13 at 08:16
@cerbrus: You're right, I just say _regex_, _html_ and _parse_, so I leaped to the wrong conclusion. When I commented, there was no code to show what the OP had tried thusfar, however, so I left the comment as is — Elias Van Ootegem, Jan 30 '13 at 08:20
You should check on a Markdown implementation for this. This has already been done. — nhahtdh, Jan 30 '13 at 09:58
@Cerbrus - I was trying to make a funny... note there's a link in my comment. — jahroy, Feb 01 '13 at 00:30
final solution does not work for string like this: (https://example.com/the-new-control-plane/generating-self-signed-certificates-on-windows-7812a600c2d8) — user1892777, Apr 14 '20 at 20:59

score 10 · Accepted Answer · edited Feb 09 '22 at 18:16

10

Try this regex:

/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g

var s = "[html title](http://www.htmlpage.com)[html title](http://www.htmlpage.com)\n\
[html title](http://www.htmlpage.com)   [html title](http://www.htmlpage.com)\n\
[html title](http://www.htmlpage.com) wejwelfj http://www.htmlpage.com";

s.replace(/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g, '<a href="$2">$1</a>');

Regex Explanation:

# /                   - Regex Start
# \[                  - a `[` character (escaped)
# (.+?)               - Followed by any amount of words, grouped, non-greedy, so it won't match past:
# \]                  - a `]` character (escaped)
# \(                  - Followed by a `(` character (escaped)
# (https?:\/\/
#   [a-zA-Z0-9/.(]+?) - Followed by a string that starts with `http://` or `https://`
# \)                  - Followed by a `)` character (escaped)
# /g                  - End of the regex, search globally.

Now the 2 strings in the () / [] are captured, and placed in the following string:

'<a href="$2">$1</a>';

This works for your "problematic" string:

var s = "[This](http://i.imgur.com/iIlhrEu.jpg) one got me crying first, then once the floodgates were opened [this](http://i.imgur.com/IwSNFVD.jpg) one did it again and [this](http://i.imgur.com/hxIwPKJ.jpg). Ugh, feels. Gotta go hug someone/something."
s.replace(/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g, '<a href="$2">$1</a>')

// Result:

'<a href="http://i.imgur.com/iIlhrEu.jpg">This</a> one got me crying first, then once the floodgates were opened <a href="http://i.imgur.com/IwSNFVD.jpg">this</a> one did it again and <a href="http://i.imgur.com/hxIwPKJ.jpg">this</a>. Ugh, feels. Gotta go hug someone/something.'

Some more examples with "Incorrect" input:

var s = "[Th][][is](http://x.com)\n\
    [this](http://x(.com)\n\
    [this](http://x).com)"
s.replace(/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g, '<a href="$2">$1</a>')

//   "<a href="http://x.com">Th][][is</a>
//    <a href="http://x(.com">this</a>
//    <a href="http://x">this</a>.com)"

You can't really blame the last line for breaking, since there's no way to know if the user meant to stop the url there, or not.

To catch loose urls, add this:

.replace(/(?: |^)(https?\:\/\/[a-zA-Z0-9/.(]+)/g, ' <a href="$1">$1</a>');

The (?: |^) bit catches a String start or space character, so it'll also match lines starting with a url.

edited Feb 09 '22 at 18:16

Julien Le Coupanec

7,742
9
53
67

answered Jan 30 '13 at 08:02

Cerbrus

70,800
18
132
147

Yes, to parse the bracketed hrefs. I'm just having a hard time parsing the plain hrefs after doing this replace (since these new hyperlinks are now all matches). @Explosion Pills had a solution, but it used look-behind which Javascript does not support. – BrennanR Jan 30 '13 at 08:53
`[html title](http://www.htmlpage.com) wejwelfj http://www.htmlpage.com` to `html title wejwelfj http://www.htmlpage.com` is not handled. Otherwise the problem is solved. – BrennanR Jan 30 '13 at 08:55
Ah, For the last one, couldn't we just check if there's a space in front of the `http://`? Like this: `s.replace(/(?: |^)(https?\:\/\/(\w|\.)+)/g, ' $1')`. Seems to work for me. – Cerbrus Jan 30 '13 at 08:58
Unfortunately another test case would be : `http://www.htmlpage.com` with no spaces surrounding the link at all. – BrennanR Jan 30 '13 at 09:00
Unfortunately that still doesn't quite do it... this link is not parsed correctly, `http://i.imgur.com/OgQ9Uaf.jpg` The resulting link is http://i.imgur.com, the rest of the text is not captured by the regex. You'll have to pardon me for not adding a test case demonstrating that example. – BrennanR Jan 30 '13 at 09:05
@BrennanR: You may want to mention such extra criteria in the question, next time, indeed. Fixed the regexes. – Cerbrus Jan 30 '13 at 09:09
That works with the addition of 0-9 after a-zA-Z. Thank you very much, your help is much appreciated! You just reduced about 80 lines of javascript (yes it was that bad), that didn't work, to two lines. – BrennanR Jan 30 '13 at 09:15
I'll edit the `0-9` into mu answer, forgot about those. also, did you know you can chain `.replace`? Just like this: `string.replace(regex1, with).replace(regex2, with);` That'd make it 1 line of code :P – Cerbrus Jan 30 '13 at 09:17
Haha, I guess I was aware you could, but stylistically I usually don't do such a thing. Makes me nervous when my lines of code start wrapping the screen :P – BrennanR Jan 30 '13 at 09:22
The downside to this method is that, the regex will try to grab the nearest `[` and from it, search for the nearest `]` that is followed by a http link in `()`. So a text like `a[1] will look like [this](http://link.to/picture)` will have the hyperlink for this section of text `1 will look like this`. SO's implementation of Markdown correctly hyperlink the word `this` only. – nhahtdh Jan 30 '13 at 10:09
@nhahtdh: if you have improvements, I'm all ears. – Cerbrus Jan 30 '13 at 10:15

Explosion Pills · Answer 2 · 2013-01-30T15:39:25.773

5

str.replace(/\[(.*?)\]\((.*?)\)/gi, '<a href="$2">$1</a>');

This assumes that there are no errant brackets in the string or parentheses in the URL.

Then:

str.replace(/(\s|^)(https?:\/\/.*?)(?=\s|$)/gi, '$1<a href="$2">$2</a>')

This matches an "http"-like URL that is not immediately preceded by a " (which would have just been added by the previous replacement). Feel free to use a better expression if you have it, of course.

EDIT: I edited the answer because I did not realize that JS did not have lookbehind syntax. Instead, you can see that the expression matches any space or the beginning of the line to match plain http links. The captured space has to be put back (hence the $1). A lookahead at the end is done to ensure that everything up to the next space (or end of the expression) is captured. If space is not a good boundary for you, you will have to come up with a better one.

edited Jan 30 '13 at 15:39

answered Jan 30 '13 at 08:02

Explosion Pills

188,624
52
326
405

Your first replace will place the title and url in the incorrect locations. – Cerbrus Jan 30 '13 at 08:11
The first regex appears to work. The second one is showing "invalid qualifier" when I use this: var result2 = result.replace(/(?<!")(https?:\/\/.*?)\b/, '$1'); Firefox's Error Console points to the initial / inside the replace function. – BrennanR Jan 30 '13 at 08:20
2

It appears this doesn't work because javascript does not support "look-behind". – BrennanR Jan 30 '13 at 08:51
This solution is way too loose. – nhahtdh Jan 30 '13 at 10:04
@nhahtdh what do ou mean by "too loose?" – Explosion Pills Jan 30 '13 at 15:35
@ExplosionPills: Anything in `()` is turned into link, and link may contain `()` (MSDN, Java reference). – nhahtdh Jan 30 '13 at 16:49
@nhahtdh outside the scope of the question, but it's pretty trivial just to check for `(http` instead of `(.*` – Explosion Pills Jan 30 '13 at 17:01

nhahtdh · Answer 3 · 2013-01-30T16:51:29.123

It seems that you are trying to convert Markdown syntax to HTML. Markdown syntax has yet to have a specification (I am referring to grammar, not behavior specification) for it, so you are going to walk around blindfolded and try to incorporate bug fixes for behavior that you don't want along the way, all of that while reinventing the wheel. I would recommend that you use an existing implementation rather than coding one yourself. For example, Pagedown is a JS implementation of Markdown that is currently used in StackOverflow.

If you still want a regex solution, below is my attempt. Note that I don't know whether it will play well with other features of Markdown as you progress (if you do at all).

/\[((?:[^\[\]\\]|\\.)+)\]\((https?:\/\/(?:[-A-Z0-9+&@#\/%=~_|\[\]](?= *\))|[-A-Z0-9+&@#\/%?=~_|\[\]!:,.;](?! *\))|\([-A-Z0-9+&@#\/%?=~_|\[\]!:,.;(]*\))+) *\)/i

The regex above should capture some part (I'm not confident it captures everything, the source code of Pagedown is too complex to read in one go) of the behavior of Pagedown for [description](url) style of linking (title is not supported). The regex above is mixed from 2 different regex used in the Pagedown source code.

Some features:

Capturing group 1 contains text inside [] and capturing group 2 contains the URL.
Allow escaping of [ and ] inside the text part [], by using \ e.g. [a\[1\]](http://link.com). You need to do a bit of extra processing, though.
Allow 1 level of () inside link, very useful in cases like this: [String.valueOf](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#valueOf(double))
Allow space after the link and before the ).

I don't take into account the bare link in this regex.

Reference:

Coding Horror: The Future of Markdown

How can I write a javascript regular expression to replace hyperlinks in this format []() with html hyperlinks?

3 Answers3

Linked

Related

How can I write a javascript regular expression to replace hyperlinks in this format [*](*) with html hyperlinks?

3 Answers3

Linked

Related

How can I write a javascript regular expression to replace hyperlinks in this format []() with html hyperlinks?