0

I actually wrote a regex expression to search for web URLs in a text (full code below) but on running the code, console prints out only the last URL in the text. I don't know what's wrong and I actually used a while loop. See code below and kindly help make corrections. Thanks

import java.util.*;
import java.util.regex.*;

public class Main
{
    static String query = "This is a URL http://facebook.com" 
    + " and this is another, http://twitter.com "
    + "this is the last URL http://instagram.com"
    + " all these URLs should be printed after the code execution";

    public static void main(String args[])
    {
        String pattern = "([\\w \\W]*)((http://)([\\w \\W]+)(.com))";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(query);

        while(m.find())
        {
             System.out.println(m.group(2));
        }
    }
}

On running the above code, only http://instagram.com gets printed to the console output

Skeellz
  • 71
  • 1
  • 7
  • 1
    `[\\w \\W]*` eats up *a lot* of characters, in this case everything before `http://instagram.com`. What did you mean to achieve with that part? – Biffen Apr 12 '16 at 14:06
  • @Biffen "[\\w \\W]*" is what I use to tell the compiler that there may be a few characters before each "http://"... What do u think, thanks in adv – Skeellz Apr 12 '16 at 14:15
  • What you call ‘*a few characters*’ will be treated by regex as ‘as many characters as possible’. If you don't want to capture them, just remove that part entirely. – Biffen Apr 12 '16 at 14:16
  • Removing that part gets no result – Skeellz Apr 12 '16 at 14:21
  • And using (^(http://)) will only get a result when "http://" begins the text as in - "http://twitter.com is a site" – Skeellz Apr 12 '16 at 14:22
  • Er, yes. Who said anything about `^`?! Removing the first group should be absolutely fine. If it doesn't work you should post that code. – Biffen Apr 12 '16 at 14:23

5 Answers5

1

I found another RegEx here

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)

It looks for https, but seems to be valid in your case.

I'm getting all 3 URLs printed with this code :

public class Main {

static String query = "This is a URL http://facebook.com"
        + " and this is another, http://twitter.com "
        + "this is the last URL http://instagram.com"
        + " all these URLs should be printed after the code execution";

public static void main(String[] args) {
    String pattern = "https?:\\/\\/(www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%_\\+.~#?&//=]*)";
    Pattern p = Pattern.compile(pattern);
    Matcher m = p.matcher(query);

    while (m.find()) {
        System.out.println(m.group());
    }
  }
}
Community
  • 1
  • 1
i23
  • 508
  • 2
  • 12
0

I'm not sure how reliable this pattern is, but it prints out all the URLs when I run your example.

(http://[A-Za-z0-9]+\\.[a-zA-Z]{2,3})

You will have to modify it if you encounter an url that looks like this:

http://www.instagram.com

As it will only capture URLs without the 'www'.

0

Your problem is that your regex quantifiers (i.e. the * and + characters) are greedy, meaning that they match as much as possible. You need to use reluctant quantifiers. See the corrected code pattern below - just two extra characters - a ? character after the * and + to match as little as possible.

String pattern = "([\\w \\W]*?)((http://)([\\w \\W]+?)(.com))";
Community
  • 1
  • 1
entpnerd
  • 10,049
  • 8
  • 47
  • 68
0

Perhaps you're looking for this regex:

http://(\w+(?:\.\w+)+)

For example, from this string:

http://ww1.amazon.com and http://npr.org

it extracts

"ww1.amazon.com"
"npr.org"

To break down how it works:

http://      is literal
( ... )      is the main capture group
\w+          find one or more alphanumeric characters
(?: ... )    ...followed by a non-capturing group
\.\w+        ...that contains a literal period followed by at least one alphanumeric
+            repeated one or more times

Hope this helps.

fearless_fool
  • 33,645
  • 23
  • 135
  • 217
0

I hope this will clear it for you but you are matching too many characters, your match should be as restrictive as possible because regex is greedy and is going to try to match as much as possible.

here is my take on your code:

public class Main {


static String query = "This is a URL http://facebook.com"
                + " and this is another, http://twitter.com "
                + "this is the last URL http://instagram.com"
                + " all these URLs should be printed after the code execution";
public static void main(String args[]) {
        String pattern = "(http:[/][/][Ww.]*[a-zA-Z]+.com)";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(query);

        while(m.find())
        {
            System.out.println(m.group(1));
        }
}

}

the above cote will match only your examples if you wish to match more you need to tweak it to your needs.

And a great way to live test patterns is http://www.regexpal.com/ you can tweet your pattern there to match exactly what you want just remember to replace the \ with double \\ in java for escaped caracters .

Mihai
  • 420
  • 6
  • 9