3

I am having problems trying to use the regular expression that I used in JavaScript. On a web page, you may have:

<b>Renewal Date:</b> 03 May 2010</td>

I just want to be able to pull out the 03 May 2010, remembering that a webpage has more than just the above content. The way I currently perform this using JavaScript is:

DateStr = /<b>Renewal Date:<\/b>(.+?)<\/td>/.exec(returnedHTMLPage);

I tried to follow some tutorials on java.util.regex.Pattern and java.util.regex.Matcher with no luck. I can't seem to be able to translate (.+?) into something they can understand??

thanks,

Noeneel

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
bebeTech
  • 153
  • 2
  • 13

4 Answers4

4

This is how regular expressions are used in Java:

Pattern p = Pattern.compile("<b>Renewal Date:</b>(.+?)</td>");
Matcher m = p.matcher(returnedHTMLPage);

if (m.find()) // find the next match (and "generate the groups")
    System.out.println(m.group(1)); // prints whatever the .+? expression matched.

There are other useful methods in the Matcher class, such as m.matches(). Have a look at Matcher.

fospathi
  • 537
  • 1
  • 6
  • 7
aioobe
  • 413,195
  • 112
  • 811
  • 826
  • To get a partial match, you need to use `find()` rather than `matches()`. I've edited aioobe's answer to fix this. – Jan Goyvaerts May 04 '10 at 09:32
  • @aiobe, thank you once again. Find() did the trick. Sorry for the newbie questions, have done lots of self taught JavaScript and am now trying to transition to JAVA. – bebeTech May 04 '10 at 11:35
4

On matches vs find

The problem is that you used matches when you should've used find. From the API:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The find method scans the input sequence looking for the next subsequence that matches the pattern.

Note that String.matches(String regex) also looks for a full match of the entire string. Unfortunately String does not provide a partial regex match, but you can always s.matches(".*pattern.*") instead.


On reluctant quantifier

Java understands (.+?) perfectly.

Here's a demonstration: you're given a string s that consists of a string t repeating at least twice. Find t.

System.out.println("hahahaha".replaceAll("^(.+)\\1+$", "($1)"));
// prints "(haha)" -- greedy takes longest possible

System.out.println("hahahaha".replaceAll("^(.+?)\\1+$", "($1)"));
// prints "(ha)" -- reluctant takes shortest possible

On escaping metacharacters

It should also be said that you have injected \ into your regex ("\\" as Java string literal) unnecessarily.

        String regexDate = "<b>Expiry Date:<\\/b>(.+?)<\\/td>";
                                            ^^         ^^
        Pattern p2 = Pattern.compile("<b>Expiry Date:<\\/b>");
                                                      ^^

\ is used to escape regex metacharacters. A / is NOT a regex metacharacter.

See also

Community
  • 1
  • 1
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • I only used it as it was suggested previously that I needed it. If I have just: String regexDate = "Expiry Date:(.+?)"; Pattern p = Pattern.compile(regexDate); Matcher m = p.matcher(returnedHTML); if (m.matches()) // check if it matches (and "generate the groups") { System.out.println("*******REGEX RESULT*******"); System.out.println(m.group(1)); // prints whatever the .+? expression matched. System.out.println("*******REGEX RESULT*******"); } it still fails – bebeTech May 04 '10 at 09:01
  • @bebeTech: use `if (m.find())` instead of `if (m.matches())` in this case. Look at the documentation to see difference. – polygenelubricants May 04 '10 at 09:13
  • @polygenelubricants: the problem is not the backslashes he introduced. They end up just quoting the / following them, so shouldn't mess with the results (although they are of course superfluous). – wds May 04 '10 at 10:13
  • @wds: Ah, you're right. `"/".matches("\\/")` is `true`. Answer restructured. – polygenelubricants May 04 '10 at 10:19
1

Ok, so using aioobe's original suggestion (which i also tried earlier), I have:

String regexDate = "<b>Expiry Date:</b>(.+?)</td>";
Pattern p = Pattern.compile(regexDate);
Matcher m = p.matcher(returnedHTML);

if (m.matches()) // check if it matches (and "generate the groups")
{
  System.out.println("*******REGEX RESULT*******"); 
  System.out.println(m.group(1)); // prints whatever the .+? expression matched.
  System.out.println("*******REGEX RESULT*******"); 
}

The IF statement must keep coming up FALSE as the *******REGEX RESULT******* is never outputted.

If anyone missed what I am trying to achieve, I am just wanting to get the date out. Amongst a html page is a date like <b>Expiry Date:</b> 03 May 2010</td> and I want the 03 May 2010.

bebeTech
  • 153
  • 2
  • 13
  • Then change `if (m.maches())` to `if (m.find())`. As @polygenelubricants mentioned above! @Jan even kindly updated my post to use `find()` instead of `matches()`. – aioobe May 04 '10 at 09:41
  • Yes, 2 people with the correct answer but I can only tick one? – bebeTech May 04 '10 at 11:37
0

(.+?) is an odd choice. Try ( *[0-9]+ *[A-Za-z]+ *[0-9]+ *) or just ([^<]+) instead.

drawnonward
  • 53,459
  • 16
  • 107
  • 112
  • It works and validates as an ok syntax to use. Just can't get it to work with JAVA. I have it working fine with JS. – bebeTech May 04 '10 at 09:13