1

I recently posted an answer with the following link:

https://cache-elastic-pandora.ecn.cl/emol/noticia/_search?q=publicada:true+AND+ultimoMinuto:true+AND+seccion:+AND+temas.id:&sort=fechaModificacion:desc&size=15&from=45

Actual link here:

https://cache-elastic-pandora.ecn.cl/emol/noticia/_search?q=publicada:true+AND+ultimoMinuto:true+AND+seccion:*+AND+temas.id:*&sort=fechaModificacion:desc&size=15&from=45

And was surprised that StackOverflow isn't able to accurately mark-up this hyperlink.

I know this isn't comprehensive (and missing quite a bit), but even a very crude regex up until a space, with a negative lookbehind to remove ending punctuations, is able to capture this:

https?:\/\/[^\s]+(?<![,.)\]?!])

https://regex101.com/r/9ZblaL/2/

Does anyone know what the StackExchange link-markup uses? And what might be a better regex that can be used to parse basic web links?

Update: I think the link itself has characters that are being interpreted as markup and then stripped before the link itself is being constructed (for example, the character *).

David542
  • 104,438
  • 178
  • 489
  • 842

1 Answers1

1

The regular expression that SE uses is:

(="|<)?\b(https?|ftp)(:\/\/[-A-Z0-9+&@#\/%?=~_|[\]()!:,.;]*[-A-Z0-9+&@#\/%=~_|[\])])(?=$|\W)

which is constructed from (around line 1530):

    var charInsideUrl = "[-A-Z0-9+&@#/%?=~_|[\\]()!:,.;]",
        charEndingUrl = "[-A-Z0-9+&@#/%=~_|[\\])]",
        autoLinkRegex = new RegExp("(=\"|<)?\\b(https?|ftp)(://" + charInsideUrl + "*" + charEndingUrl + ")(?=$|\\W)", "gi"),

Your URL isn't fully matched because the * is not part of the charInsideUrl character set. Fix that (add * to the character set), and the pattern matches your entire URL.

Asterisks are officially permitted in query strings, so I don't immediately see anything wrong with just adding them to the character set.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • right I see. I think they might not use the `*` because precedence is given to using that as markup, for example with *italics* (`*italics*`) so i'm guessing it's intentional on their part, but maybe not... – David542 Nov 24 '19 at 05:10
  • 1
    I'm pretty sure links are processed *before* italics and bold, see line ~561: `text = _DoAutoLinks(text); text = text.replace(/§P/g, "://"); // put in place to prevent autolinking; reset now text = _EncodeAmpsAndAngles(text); text = _DoItalicsAndBold(text);` where `_DoAutoLinks` turns the substrings that *look* like URLs into true ``s. – CertainPerformance Nov 24 '19 at 05:14