8

I currently do automatic detection of hyperlinks within text in my program. I made it very simple and only look for http:// or www.

However, a user suggested to me that I extend it to other forms, e.g.: https:// or .com

Then I realized it might not stop there because there's ftp and mailto and file, all the other top level domains, and even email addresses and file paths.

What I think is best is to limit it to what is practical by following some often-used standard set of hyperlink detection rules that are currently in use. Maybe how Microsoft Word does it, or maybe how RichEdit does it or maybe you know of a better standard.

So my question is:

Is there a built in function that I can call from Delphi to do the detection, and if so, what would the call look like? (I plan in the future to go to FireMonkey, so I would prefer something that will work beyond Windows.)

If there isn't a function available, is there some place I can find a documented set of rules of what is detected in Word, in RichEdit, or any other set of rules of what should be detected? That would then allow me to write the detection code myself.

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
lkessler
  • 19,819
  • 36
  • 132
  • 203
  • I very much doubt there is a "standard" out there, only "What do various MS Office products like Word, Excel, and Outlook do". Since it's open source, if you can read C++, I would look at the functionality in mozilla thunderbird. – Warren P Jan 23 '12 at 14:46

3 Answers3

7

Try the PathIsURL function which is declarated in the ShLwApi unit.

RRUZ
  • 134,889
  • 20
  • 356
  • 483
  • That won't do the whole job when the path is embedded within other text. – Rob McDonell Jan 23 '12 at 06:17
  • 4
    This wouldn't be too bad if I check each word (delimited by spaces or other non-url characters) longer than say, 5 characters within my text. – lkessler Jan 23 '12 at 06:40
3

Following regex taken from RegexBuddy's library might get you started (I can't make any claims about performance).

Regex

Match; JGsoft; case insensitive:  
\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

Explanation

URL: Find in full text The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.

Matches (whole or partial)

http://regexbuddy.com
http://www.regexbuddy.com 
http://www.regexbuddy.com/ 
http://www.regexbuddy.com/index.html 
http://www.regexbuddy.com/index.html?source=library 
You can download RegexBuddy at http://www.regexbuddy.com/download.html.

Does not match

regexbuddy.com
www.regexbuddy.com
"www.domain.com/quoted URL with spaces"
support@regexbuddy.com

For a set of rules you might look into RFC 3986

A Uniform Resource Identifier (URI) is a compact sequence of
characters that identifies an abstract or physical resource. This
specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet

A regex that validates a URL as specified in RFC 3986 would be

^
(# Scheme
 [a-z][a-z0-9+\-.]*:
 (# Authority & path
  //
  ([a-z0-9\-._~%!$&'()*+,;=]+@)?              # User
  ([a-z0-9\-._~%]+                            # Named host
  |\[[a-f0-9:.]+\]                            # IPv6 host
  |\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])  # IPvFuture host
  (:[0-9]+)?                                  # Port
  (/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?          # Path
 |# Path without authority
  (/?[a-z0-9\-._~%!$&'()*+,;=:@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?)?
 )
|# Relative URL (no scheme or authority)
 ([a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?  # Relative path
 |(/[a-z0-9\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
$
Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
1

Regular Expressions may be the way to go here, to define the various patterns which you deem to be appropriate hyperlinks.

Rob McDonell
  • 1,309
  • 9
  • 15
  • 1
    I have seen various implementations of regular expressions to do this, but how do I determine which ones are a "standard set". My other concern is how efficient they are, since I've got big files to process. – lkessler Jan 23 '12 at 06:28
  • Use regular expression *especially* if you're concerned about performance. The RegEx language can express what you're looking for very nicely, and the RegEx compiler will turn that into something very efficient. For complex expressions it's definitively faster and easier to maintain then hand-coded parsers. – Cosmin Prund Jan 23 '12 at 07:48