6

I need the parse text with links in the following formats:

[html title](http://www.htmlpage.com)
http://www.htmlpage.com
https://i.stack.imgur.com/rDDPu.jpg

The output for those two strings would be:

<a href='http://www.htmlpage.com'>html title</a>
<a href='http://www.htmlpage.com'>http://www.htmlpage.com</a>
<a href='https://i.stack.imgur.com/rDDPu.jpg'>https://i.stack.imgur.com/rDDPu.jpg</a>

The string could include an arbitrary amount of these links, ie:

[html title](http://www.htmlpage.com)[html title](http://www.htmlpage.com)
[html title](http://www.htmlpage.com)   [html title](http://www.htmlpage.com)
[html title](http://www.htmlpage.com) wejwelfj http://www.htmlpage.com

output:

<a href='http://www.htmlpage.com'>html title</a><a href='http://www.htmlpage.com'>html title</a>
<a href='http://www.htmlpage.com'>html title</a>    <a href='http://www.htmlpage.com'>html title</a>
<a href='http://www.htmlpage.com'>html title</a> wejwelfj <a href='http://www.htmlpage.com'>http://www.htmlpage.com</a>

I have an extremely long function that does an alright job by passing over the string 3 times, but I can't successfully parse this string:

[This](http://i.imgur.com/iIlhrEu.jpg) one got me crying first, then once the floodgates were opened [this](http://i.imgur.com/IwSNFVD.jpg) one did it again and [this](http://i.imgur.com/hxIwPKJ.jpg). Ugh, feels. Gotta go hug someone/something.

For brevity, I'll post the regular expressions I've tried rather than the entire find/replace function:

var matchArray2 = inString.match(/\[.*\]\(.*\)/g);

for matching [*](*), doesn't work because []()[]() is matched

Really that's it, I guess. Once I make that match I search that match for () and [] to parse out the link an link text and build the href tag. I delete matches from a temp string so I don't match them when I do my second pass to find plain hyperlinks:

var plainLinkArray = tempString2.match(/http\S*:\/\/\S*/g);

I'm not parsing any html with regex. I'm parsing a string and attempting to output html.

edit: I added the requirement that it parse the third link https://i.stack.imgur.com/rDDPu.jpg after the fact.

my final solution (based on @Cerbrus's answer):

function parseAndHandleHyperlinks(inString)
{
    var result = inString.replace(/\[(.+?)\]\((https?:\/\/.+?)\)/g, '<a href="$2">$1</a>');
    return result.replace(/(?: |^)(https?\:\/\/[a-zA-Z0-9/.(]+)/g, ' <a href="$1">$1</a>');     
}
BrennanR
  • 203
  • 3
  • 13
  • 1
    [What have you tried](http://whathaveyoutried.com)? As many ppl here will tell you, parsing HTML with regex... that way madness lies, [as you can see here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) sure, if the only markup you have to deal with it's possible, but do look into the alternatives – Elias Van Ootegem Jan 30 '13 at 08:00
  • I couldn't possibly think of a place where [that](http://stackoverflow.com) would be useful... – jahroy Jan 30 '13 at 08:12
  • @jahroy: Have you seen how urls are made on here? Let me give you a hint: `[title](url)` or `[title][1] <....> [1]:url`. Parsers like this are useful on forums and other community sites like that. – Cerbrus Jan 30 '13 at 08:13
  • 1
    Also, @EliasVanOotegem: there's a difference between trying to interpret a HTML document, and trying to parse one specific format into HTML. – Cerbrus Jan 30 '13 at 08:16
  • @cerbrus: You're right, I just say _regex_, _html_ and _parse_, so I leaped to the wrong conclusion. When I commented, there was no code to show what the OP had tried thusfar, however, so I left the comment as is – Elias Van Ootegem Jan 30 '13 at 08:20
  • You should check on a Markdown implementation for this. This has already been done. – nhahtdh Jan 30 '13 at 09:58
  • @Cerbrus - I was trying to make a funny... note there's a link in my comment. – jahroy Feb 01 '13 at 00:30
  • @jahroy: oh darn, how did I miss that o.O – Cerbrus Feb 01 '13 at 07:07
  • final solution does not work for string like this: (https://example.com/the-new-control-plane/generating-self-signed-certificates-on-windows-7812a600c2d8) – user1892777 Apr 14 '20 at 20:59

3 Answers3

10

Try this regex:

/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g

var s = "[html title](http://www.htmlpage.com)[html title](http://www.htmlpage.com)\n\
[html title](http://www.htmlpage.com)   [html title](http://www.htmlpage.com)\n\
[html title](http://www.htmlpage.com) wejwelfj http://www.htmlpage.com";

s.replace(/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g, '<a href="$2">$1</a>');

Regex Explanation:

# /                   - Regex Start
# \[                  - a `[` character (escaped)
# (.+?)               - Followed by any amount of words, grouped, non-greedy, so it won't match past:
# \]                  - a `]` character (escaped)
# \(                  - Followed by a `(` character (escaped)
# (https?:\/\/
#   [a-zA-Z0-9/.(]+?) - Followed by a string that starts with `http://` or `https://`
# \)                  - Followed by a `)` character (escaped)
# /g                  - End of the regex, search globally.

Now the 2 strings in the () / [] are captured, and placed in the following string:

'<a href="$2">$1</a>';

This works for your "problematic" string:

var s = "[This](http://i.imgur.com/iIlhrEu.jpg) one got me crying first, then once the floodgates were opened [this](http://i.imgur.com/IwSNFVD.jpg) one did it again and [this](http://i.imgur.com/hxIwPKJ.jpg). Ugh, feels. Gotta go hug someone/something."
s.replace(/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g, '<a href="$2">$1</a>')

// Result:

'<a href="http://i.imgur.com/iIlhrEu.jpg">This</a> one got me crying first, then once the floodgates were opened <a href="http://i.imgur.com/IwSNFVD.jpg">this</a> one did it again and <a href="http://i.imgur.com/hxIwPKJ.jpg">this</a>. Ugh, feels. Gotta go hug someone/something.'

Some more examples with "Incorrect" input:

var s = "[Th][][is](http://x.com)\n\
    [this](http://x(.com)\n\
    [this](http://x).com)"
s.replace(/\[(.+?)\]\((https?:\/\/[a-zA-Z0-9/.(]+?)\)/g, '<a href="$2">$1</a>')

//   "<a href="http://x.com">Th][][is</a>
//    <a href="http://x(.com">this</a>
//    <a href="http://x">this</a>.com)"

You can't really blame the last line for breaking, since there's no way to know if the user meant to stop the url there, or not.

To catch loose urls, add this:

.replace(/(?: |^)(https?\:\/\/[a-zA-Z0-9/.(]+)/g, ' <a href="$1">$1</a>');

The (?: |^) bit catches a String start or space character, so it'll also match lines starting with a url.

Julien Le Coupanec
  • 7,742
  • 9
  • 53
  • 67
Cerbrus
  • 70,800
  • 18
  • 132
  • 147
  • Yes, to parse the bracketed hrefs. I'm just having a hard time parsing the plain hrefs after doing this replace (since these new hyperlinks are now all matches). @Explosion Pills had a solution, but it used look-behind which Javascript does not support. – BrennanR Jan 30 '13 at 08:53
  • `[html title](http://www.htmlpage.com) wejwelfj http://www.htmlpage.com` to `html title wejwelfj http://www.htmlpage.com` is not handled. Otherwise the problem is solved. – BrennanR Jan 30 '13 at 08:55
  • Ah, For the last one, couldn't we just check if there's a space in front of the `http://`? Like this: `s.replace(/(?: |^)(https?\:\/\/(\w|\.)+)/g, ' $1')`. Seems to work for me. – Cerbrus Jan 30 '13 at 08:58
  • Unfortunately another test case would be : `http://www.htmlpage.com` with no spaces surrounding the link at all. – BrennanR Jan 30 '13 at 09:00
  • Unfortunately that still doesn't quite do it... this link is not parsed correctly, `http://i.imgur.com/OgQ9Uaf.jpg` The resulting link is http://i.imgur.com, the rest of the text is not captured by the regex. You'll have to pardon me for not adding a test case demonstrating that example. – BrennanR Jan 30 '13 at 09:05
  • @BrennanR: You may want to mention such extra criteria in the question, next time, indeed. Fixed the regexes. – Cerbrus Jan 30 '13 at 09:09
  • That works with the addition of 0-9 after a-zA-Z. Thank you very much, your help is much appreciated! You just reduced about 80 lines of javascript (yes it was that bad), that didn't work, to two lines. – BrennanR Jan 30 '13 at 09:15
  • I'll edit the `0-9` into mu answer, forgot about those. also, did you know you can chain `.replace`? Just like this: `string.replace(regex1, with).replace(regex2, with);` That'd make it 1 line of code :P – Cerbrus Jan 30 '13 at 09:17
  • Haha, I guess I was aware you could, but stylistically I usually don't do such a thing. Makes me nervous when my lines of code start wrapping the screen :P – BrennanR Jan 30 '13 at 09:22
  • The downside to this method is that, the regex will try to grab the nearest `[` and from it, search for the nearest `]` that is followed by a http link in `()`. So a text like `a[1] will look like [this](http://link.to/picture)` will have the hyperlink for this section of text `1 will look like this`. SO's implementation of Markdown correctly hyperlink the word `this` only. – nhahtdh Jan 30 '13 at 10:09
  • @nhahtdh: if you have improvements, I'm all ears. – Cerbrus Jan 30 '13 at 10:15
5
str.replace(/\[(.*?)\]\((.*?)\)/gi, '<a href="$2">$1</a>');

This assumes that there are no errant brackets in the string or parentheses in the URL.

Then:

str.replace(/(\s|^)(https?:\/\/.*?)(?=\s|$)/gi, '$1<a href="$2">$2</a>')

This matches an "http"-like URL that is not immediately preceded by a " (which would have just been added by the previous replacement). Feel free to use a better expression if you have it, of course.

EDIT: I edited the answer because I did not realize that JS did not have lookbehind syntax. Instead, you can see that the expression matches any space or the beginning of the line to match plain http links. The captured space has to be put back (hence the $1). A lookahead at the end is done to ensure that everything up to the next space (or end of the expression) is captured. If space is not a good boundary for you, you will have to come up with a better one.

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
3

It seems that you are trying to convert Markdown syntax to HTML. Markdown syntax has yet to have a specification (I am referring to grammar, not behavior specification) for it, so you are going to walk around blindfolded and try to incorporate bug fixes for behavior that you don't want along the way, all of that while reinventing the wheel. I would recommend that you use an existing implementation rather than coding one yourself. For example, Pagedown is a JS implementation of Markdown that is currently used in StackOverflow.

If you still want a regex solution, below is my attempt. Note that I don't know whether it will play well with other features of Markdown as you progress (if you do at all).

/\[((?:[^\[\]\\]|\\.)+)\]\((https?:\/\/(?:[-A-Z0-9+&@#\/%=~_|\[\]](?= *\))|[-A-Z0-9+&@#\/%?=~_|\[\]!:,.;](?! *\))|\([-A-Z0-9+&@#\/%?=~_|\[\]!:,.;(]*\))+) *\)/i

The regex above should capture some part (I'm not confident it captures everything, the source code of Pagedown is too complex to read in one go) of the behavior of Pagedown for [description](url) style of linking (title is not supported). The regex above is mixed from 2 different regex used in the Pagedown source code.

Some features:

  • Capturing group 1 contains text inside [] and capturing group 2 contains the URL.
  • Allow escaping of [ and ] inside the text part [], by using \ e.g. [a\[1\]](http://link.com). You need to do a bit of extra processing, though.
  • Allow 1 level of () inside link, very useful in cases like this: [String.valueOf](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#valueOf(double))
  • Allow space after the link and before the ).

I don't take into account the bare link in this regex.

Reference:

nhahtdh
  • 55,989
  • 15
  • 126
  • 162