Java Regex URL Matching

Question

I have a simple regular expression that matches some URL and it works fine however I'd like to refine it a bit so it excludes a URL containing a certain word.

My Patter: (http:[A-z0-9./~%]+)

IE:

http://maps.google.com/maps
http://www.google.com/flights/gwsredirect
http://slav0nic.org.ua/static/books/python/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/doc/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/

Give the list of URL above matched by my pattern, I'd like to refine my pattern to exclude URL containing the word for example google

I tried using non capturing groups but was unsuccessful, maybe I'm missing something.

ADDITIONAL INFORMATION

Maybe my description wasn't clear.

Okay I have a file of data grabbed from a URL then I use the pattern I've provided with extract the list of links given but as you can see the pattern is returning all links it's doing more than I want it to do. So I want to refine it to not give me links containing a certain word ie: google

Thus after I parse the data instead of returning the list of links above it would instead return the following:

http://slav0nic.org.ua/static/books/python/
http://www.python.org/ftp/python/doc/
http://www.python.org/ftp/python/

enter image description here

All help are appreciated, thank you!

you can use string contains method in java after verify with regex — Ashish, Jan 06 '12 at 09:06
See this question - http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word — Manish, Jan 06 '12 at 09:10

Has QUIT--Anony-Mousse · Accepted Answer · 2012-01-06T11:25:31.527

2

Try this:

(http:(?![^"\s]*google)[^"\s]+)["\s]

The key difference to the solutions posted earlier is that I control the length of the match for searching.

edited Jan 06 '12 at 11:25

answered Jan 06 '12 at 11:19

Has QUIT--Anony-Mousse

76,138
12
138
194

score 1 · Answer 2 · edited May 23 '17 at 12:07

Try this:

(http:(?!.*google).*)

Source: similar questions

EDIT: (this works, tested it)

public static void main( String[] args ) {

    final Pattern p = Pattern.compile( "(http:(?!.*google).*)" );
    final String[] in = new String[]{
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/",
    };

    for ( final String s : in ) {    
      final Matcher m = p.matcher( s );
      System.out.print( s );
      if ( m.find() ) {
        System.out.println( " true" );
      } else {
        System.out.println( " false" );
      }
    }
}

OUTPUT:

http://maps.google.com/maps false
http://www.google.com/flights/gwsredirect false
http://slav0nic.org.ua/static/books/python/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/doc/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/ true

Pay attention to the difference of *matching* and *searching*! — Has QUIT--Anony-Mousse, Jan 06 '12 at 11:16

fge · Answer 3 · 2012-01-06T10:46:24.683

Modify your regex to capture the hostname and use .contains():

public final class TestMatch
{
    private static final List<String> urls = Arrays.asList(
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/"
    );

    private static final Pattern p
        = Pattern.compile("^http://([^/]+)/");

    private static final int TRIES = 50000;

    public static void main(final String... args)
    {
        for (final String url: urls)
            System.out.printf("%s: %b\n", url, regexIsOK(url));

        long start, end;

        start = System.currentTimeMillis();
        for (int i = 0; i < TRIES; i++)
            for (final String url: urls)
                regexIsOK(url);
        end = System.currentTimeMillis();

        System.out.println("Time taken: " + (end - start) + " ms");
        System.exit(0);
    }

    private static boolean regexIsOK(final String url)
    {
        final Matcher m = p.matcher(url);

        return m.find() && !m.group(1).contains("google");
    }
}

Sample output:

http://maps.google.com/maps: false
http://www.google.com/flights/gwsredirect: false
http://slav0nic.org.ua/static/books/python/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/doc/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/: true
Time taken: 258 ms

I'm sorry this isn't what I'm looking for, if I do it this way I will be doing more work than I need to do. The list of URLS aren't known I use Regex to get them but my pattern returns more than I want so I want to refine that exact pattern to return not containing a certain word. — Sinista, Jan 06 '12 at 09:49
As you wish, but then why not just use `.contains()` after you have matched your regex (which does allow invalid URLs BTW -- URI will allow you to detect that)? — fge, Jan 06 '12 at 09:59
But that will cause a huge overhead going through each URL one by one. What you think will happen if lets say I have about 10,000 or more URL. It must be a one shot thing RegEx must return exactly what I want from the match. — Sinista, Jan 06 '12 at 10:11

Java Regex URL Matching

ADDITIONAL INFORMATION

3 Answers3