47

I am using a regular expression to convert plain text URL to clickable links.

@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?)?)@

However, sometimes in the body of the text, URL are enumerated one per line with a semi-colon at the end. The real URL does not contain any ";".

http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124

Is it permitted to have a semicolon (;) in a URL or can the semicolon be considered a marker of the end of an URL? How would that fit in my regular expression?

vhs
  • 9,316
  • 3
  • 66
  • 70
Vincent
  • 22,366
  • 18
  • 58
  • 61
  • This is a convoluted question which led everyone to miss the question except for @Alan Moore. The title asks if `;` is valid in a URL, but then the actual "real" url doesn't contain `;`. Yes, `;` is valid in http(s) urls so this is where the crux of the problem starts: How to handle these corrupted http(s) urls with a regex. Unfortunately @Vincent accepted an answer that does not answer the real question. – Zectbumo May 25 '22 at 20:13

7 Answers7

49

A semicolon is reserved and should only for its special purpose (which depends on the scheme).

Section 2.2:

Many URL schemes reserve certain characters for a special meaning: their appearance in the scheme-specific part of the URL has a designated semantics. If the character corresponding to an octet is reserved in a scheme, the octet must be encoded. The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. No other characters may be reserved within a scheme.

vhs
  • 9,316
  • 3
  • 66
  • 70
Greg
  • 316,276
  • 54
  • 369
  • 333
  • I'm late to the party, but this code deals explicitly with http/https urls, which allowes ; as the query string separator (instead of &)... actually, Ben already covered that. – Powerlord Feb 22 '10 at 20:30
  • This conditional answer is accurate and misleading and does not address the actual question about the regex. The regex in question is http(s) and yes ';' is allowed in http(s) urls. In fact, it was once recommended by the w3c to use `;` instead of `&` because http urls were mostly going to be used in html and html escape sequences start with the `&` character. – Zectbumo May 25 '22 at 20:21
34

The W3C encourages CGI programs to accept ; as well as & in query strings (i.e. treat ?name=fred&age=50 and ?name=fred;age=50 the same way). This is supposed to be because & has to be encoded as & in HTML whereas ; doesn't.

Dylan Beattie
  • 53,688
  • 35
  • 128
  • 197
15

The semi-colon is a legal URI character; it belongs to the sub-delimiter category: http://www.ietf.org/rfc/rfc3986.txt

However, the specification states that whether the semi-colon is legitimate for a specific URI or not depends on the scheme or producer of that URI. So, if site using those links doesn't allow semi-colons, then they're not valid for that particular case.

Zectbumo
  • 4,128
  • 1
  • 31
  • 26
  • Your interpretation of the specification is not accurate. The spec doesn't say "legitimate... or not". It says "safe to be used by scheme-specific and producer-specific algorithms". There is no "not" in the spec. – Zectbumo May 25 '22 at 20:40
8

Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.

And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.

In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.

For example, sending someone an email with this link:

http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/

will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/malicious_file/) presumably exists* and the user will be directed to the malicious_file page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.

And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.

*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.

PROGRAM_IX
  • 404
  • 1
  • 5
  • 21
No Spam
  • 97
  • 1
  • 1
  • 6
    Which app opens `0200.0xfe.0x37.0xbf` because it knows the yahoo link will return a 404 status?! Does not make sense to me. – mgutt Feb 17 '17 at 12:32
6

http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.

EricLaw
  • 56,563
  • 7
  • 151
  • 196
5

Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..

If you're only interested in URLs with an explicit http[s] protocol, and your regex flavor supports lookbehinds, this regex should suffice:

https?://[\w!#$%&'()*+,./:;=?@\[\]-]+(?<![!,.?;:"'()-])

After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.

I replaced the Regex we were using with the following pattern (and have tested that it works):

 string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-zA-Z0-9-]+\.[a-zA-Z0-9\/&#95;:@=.+?,##%&~_-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";

This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)

  • 7
    I added code formatting so we could read it more easily, but I don't recommend using that regex. Leaving aside the obvious web mangling and the many redundant backslashes and pipes, the final two character classes are seriously flawed. Not only do they exclude valid characters like semicolons and parentheses, that last one matches all kinds of *invalid* characters like quotation marks, braces, and non-ASCII characters. – Alan Moore Feb 16 '10 at 07:47