1

I am creating a simple utility to retrieve all HTTP URL's from a webpage.

Initially I had planned to use a HTML parsing library to parse out the HREF tags but I got to know that I need to retrieve the URL contained inside the script too (Example script below) hence I started trying out regular expression to get all the HTTP url from the web page but for some reason my regular expression is not working properly.

The URL can be inside a javascript

<script> 
    if(jQuery.browser.msie) 
    { 
        var v= 'http://test.com/test/test'; 
    } 
</script> 

My program:

try {

            BufferedReader in=new BufferedReader(new FileReader("c:\\sample\\sample.html"));
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
                String pattern = "http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?";

                // Create a Pattern object
                Pattern r = Pattern.compile(pattern);
                // Now create matcher object.
                Matcher m = r.matcher(inputLine.replaceAll("http://", "\nhttp://"));
                while (!m.hitEnd()) {
                    if (m.find()) {
                        System.out.println("Found value: " + m.group(0));
                    } else {
                        //System.out.println("NO MATCH");
                    }
                }
            }
            in.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

Can someone help me fix this issue or let me know the best way to retrieve all URL's from a web page?

Learner
  • 2,303
  • 9
  • 46
  • 81

2 Answers2

1

Description

Your expression has a typo. It should make the s optional.

https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?
    ^

Also I recommend:

  • replacing the (...) capture groups with non capture groups like (?:...)
  • you don't need to escape a . inside a character group [.]
  • add a test to ensure you're not captureing the close quotes surrounding your url
  • rewrite your section looking for /folder/subfolder sections as a repeating non-capture group looking for the initial slash followed by the folder name

regex: https?:\/\/(?:[\w-]+.)+(?::\d+)?(?:\/[\w\/_.]*)*?(?:\?\S+)?(?=['"\s])

as a Java string: "https?:\\/\\/(?:[\\w-]+.)+(?::\\d+)?(?:\\/[\\w\\/_.]*)*?(?:\\?\\S+)?(?=['\"\\s])"

enter image description here

Example

Live Demo

Sample Text

<script> 
    if(jQuery.browser.msie) 
    { 
        var v= 'http://test.com/test/test'; 
    } 
</script> 
<a class="test" href="http://blablablablabla.com">Third Link</a>

Matches

[0] => http://test.com/test/test
[1] => http://blablablablabla.com
Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
0

try using this

\A'http:\/\/[\w\W]+'\z

this will check that your url must be starting from http:// and is an string in starting and ending and as in between the url nowadys anything can come so we will have to allow special character like ?:,-_/\ and also the numbers digits etc.

so this will get you all the urls present in the file.

dirtydexter
  • 1,063
  • 1
  • 10
  • 17