1

I would like to get URLs given by user in his/her text (I assume that URL must be started with http://) . This is first attempt:

Pattern pattern = Pattern.compile("http://[^ ]+");

but if user types something like this:

"look at somepage (http://somepage.net)"
"look at http://somepage1.net, http://somepage2.net and sth else"
"Please visit our page http://somepage.net."

the URL was with incorrect(?) character at the end. How to avoid this?

bltc
  • 371
  • 1
  • 3
  • 9
  • 1
    possible duplicate of [Extracting URLs from a text document using Java + Regular Expressions](http://stackoverflow.com/questions/1806017/extracting-urls-from-a-text-document-using-java-regular-expressions) – Joel Jan 28 '11 at 12:16
  • possible duplicate of [Java-How to detect the presence of URL in a string.](http://stackoverflow.com/questions/285619/java-how-to-detect-the-presence-of-url-in-a-string) – dogbane Jan 28 '11 at 12:17
  • @Joel ok thats seem works good http://stackoverflow.com/questions/1806017/extracting-urls-from-a-text-document-using-java-regular-expressions/1806161#1806161 but I don't understand this pattern and I hope that it is fast. – bltc Jan 28 '11 at 12:27
  • @Joel unfortunately that not catch URL with national-specific characters in URL And modifing this pattern will be rather hard task:) – bltc Jan 28 '11 at 12:42

2 Answers2

0

Can math, what URL can't end by [,.)] etc, end only [A-Za-z] or / , but this broke url's whith specific end such as http://site.com/read.php?key=F#$.)

kolko
  • 290
  • 1
  • 3
  • 9
  • can to scorn it, i think. If make this simply:" http://[^ ]+(?<=[A-Za-z0-9#]) " Many regex's is already writen for this. just google it – kolko Jan 28 '11 at 12:48
  • the ideia of stackoverflow is exactly the opposite of sending people to google something – Prix Jul 02 '11 at 21:17
0

The answer is that you cannot do this with 100% accuracy.

A URL like "http://somepage1.net," is technically legal, and there is no way of knowing for sure whether the "," is part of the URL or just punctuation.

A URL like "http://somepage1.net or something" is technically illegal, but typical end users don't know this. (They are used to browsers that do all sorts of funky things to what they type at their browser.)

Probably, best you can do is use a regex to extract legal URLs, and then trim text punctuation characters from the right end of the URL ... on the assumption that they are not intended to be part of the URL.

You could also treat matching quotes or left / right brackets as denoting URL boundaries; e.g.

    The secret URL is "http://example.com/?" ... don't leave off the "?"
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216