Extracting URLs from a text document using Java + Regular Expressions

Question

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

along with the query string parameters.

Thanks! I wish I really knew Regular expressions better.

Cheers,

If the text documents are written by humans, you might find things like example.com, with punctuation immediately after the URL. Do you want an accepted answer to handle this, or is this not relevant? — Mark Byers, Nov 26 '09 at 22:54
You haven't accepted any answer to this question. Are none of the solutions suitable for you? What's the problem? — Mark Byers, Nov 27 '09 at 21:54

score 27 · Answer 1 · answered Nov 26 '09 at 23:48

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

If you don't mind it picking up email addresses, you can replace the authority portion (\\w+:\\w+@)? with (\\w+(:\\w+)?@)? , if you want it to not pickup email addresses, then you'd need to add some other checks. — GreenKiwi, Feb 07 '12 at 20:41

score 5 · Answer 2 · answered Jan 17 '13 at 17:47

All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

If you really want to use RegEx with Java, try Automaton

Indeed, it is. Sometimes you only need basic parsing, and although the OP wanted a regex, this was the anser that saved me. Thank you. — Henrique de Sousa, Apr 08 '13 at 22:28

DVK · Answer 3 · 2009-11-26T23:00:59.913

3

This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

edited Nov 26 '09 at 23:00

answered Nov 26 '09 at 22:55

DVK

126,886
32
213
327

score 1 · Answer 4 · answered Nov 26 '09 at 23:00

1

This tests a certain line if it is a URL

Pattern p = Pattern.compile("http://.*|www\\..*");
Matcher m = p.matcher("http://..."); // put here the line you want to check
if(m.matches()){
    so something
}

answered Nov 26 '09 at 23:00

jutky

3,895
6
31
45

Extracting URLs from a text document using Java + Regular Expressions

4 Answers4

Linked