3

I have a simple regular expression that matches some URL and it works fine however I'd like to refine it a bit so it excludes a URL containing a certain word.

My Patter: (http:[A-z0-9./~%]+)

IE:

http://maps.google.com/maps
http://www.google.com/flights/gwsredirect
http://slav0nic.org.ua/static/books/python/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/doc/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/

Give the list of URL above matched by my pattern, I'd like to refine my pattern to exclude URL containing the word for example google

I tried using non capturing groups but was unsuccessful, maybe I'm missing something.

ADDITIONAL INFORMATION

Maybe my description wasn't clear.

Okay I have a file of data grabbed from a URL then I use the pattern I've provided with extract the list of links given but as you can see the pattern is returning all links it's doing more than I want it to do. So I want to refine it to not give me links containing a certain word ie: google

Thus after I parse the data instead of returning the list of links above it would instead return the following:

http://slav0nic.org.ua/static/books/python/
http://www.python.org/ftp/python/doc/
http://www.python.org/ftp/python/

enter image description here

All help are appreciated, thank you!

Sinista
  • 437
  • 2
  • 6
  • 12

3 Answers3

2

Try this:

(http:(?![^"\s]*google)[^"\s]+)["\s]

The key difference to the solutions posted earlier is that I control the length of the match for searching.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

Try this:

(http:(?!.*google).*)

Source: similar questions

EDIT: (this works, tested it)

public static void main( String[] args ) {

    final Pattern p = Pattern.compile( "(http:(?!.*google).*)" );
    final String[] in = new String[]{
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/",
    };

    for ( final String s : in ) {    
      final Matcher m = p.matcher( s );
      System.out.print( s );
      if ( m.find() ) {
        System.out.println( " true" );
      } else {
        System.out.println( " false" );
      }
    }
}

OUTPUT:

http://maps.google.com/maps false
http://www.google.com/flights/gwsredirect false
http://slav0nic.org.ua/static/books/python/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/doc/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/ true
Community
  • 1
  • 1
mana
  • 6,347
  • 6
  • 50
  • 70
0

Modify your regex to capture the hostname and use .contains():

public final class TestMatch
{
    private static final List<String> urls = Arrays.asList(
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/"
    );

    private static final Pattern p
        = Pattern.compile("^http://([^/]+)/");

    private static final int TRIES = 50000;

    public static void main(final String... args)
    {
        for (final String url: urls)
            System.out.printf("%s: %b\n", url, regexIsOK(url));

        long start, end;

        start = System.currentTimeMillis();
        for (int i = 0; i < TRIES; i++)
            for (final String url: urls)
                regexIsOK(url);
        end = System.currentTimeMillis();

        System.out.println("Time taken: " + (end - start) + " ms");
        System.exit(0);
    }

    private static boolean regexIsOK(final String url)
    {
        final Matcher m = p.matcher(url);

        return m.find() && !m.group(1).contains("google");
    }
}

Sample output:

http://maps.google.com/maps: false
http://www.google.com/flights/gwsredirect: false
http://slav0nic.org.ua/static/books/python/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/doc/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/: true
Time taken: 258 ms
fge
  • 119,121
  • 33
  • 254
  • 329
  • I'm sorry this isn't what I'm looking for, if I do it this way I will be doing more work than I need to do. The list of URLS aren't known I use Regex to get them but my pattern returns more than I want so I want to refine that exact pattern to return not containing a certain word. – Sinista Jan 06 '12 at 09:49
  • As you wish, but then why not just use `.contains()` after you have matched your regex (which does allow invalid URLs BTW -- URI will allow you to detect that)? – fge Jan 06 '12 at 09:59
  • But that will cause a huge overhead going through each URL one by one. What you think will happen if lets say I have about 10,000 or more URL. It must be a one shot thing RegEx must return exactly what I want from the match. – Sinista Jan 06 '12 at 10:11
  • OK, look at the solution -- cheap, isn't it? – fge Jan 06 '12 at 10:35