0

I was trying to match the example in , <p><a href="example/index.html">LinkToPage</a></p>

With rubular.com I could get something like <a href=\"(.*)?\/index.html\">.*<\/a>.

I'll be using this in Pattern.compile in Java. I know that \ has to be escaped as well, and I've come up with <a href=\\\"(.*)?\\\/index.html\\\">.*<\\\/a> and a few more variations but I'm getting it wrong. I tested on regexplanet. Can anyone help me with this?

Crocode
  • 3,056
  • 6
  • 26
  • 31
  • 2
    No escaping necessary here. Just use `...=\"...` A backslash only needs to be escaped, when you actually want a backslash. And in a regex, you have to do it twice. – jlordo Jun 03 '13 at 19:34
  • 1
    If it's an escaping issue, try printing your string to the command line to figure out what it thinks it is and correcting accordingly. All those backslashes can get annoying. – chessbot Jun 03 '13 at 19:35
  • Eclipse indicates invalid escape sequence for `.*<\/a>`. – Crocode Jun 03 '13 at 19:38
  • 3
    Since this is HTML, you should consider using an HTML parser... Like, for instance, jsoup. – fge Jun 03 '13 at 19:38
  • 1
    You don't need \ before / – Pshemo Jun 03 '13 at 19:38
  • 1
    replace `...\/...` with `.../...` – jlordo Jun 03 '13 at 19:38
  • @fge It is a list of 700-800 href. So I thought this would be simple – Crocode Jun 03 '13 at 19:40
  • @Crocode it is not the question of size here; it is the question that when using regexes, you can easily match false positives; what if your pattern matches some text in a `
    ` block? You have close to zero chance to write a regex eliminating all false positives... And this is why parsers exist.
    – fge Jun 03 '13 at 19:49

3 Answers3

2

Use "<a href=\"(.*)/index.html\">.*</a>" in your Java code.

You only need to escape " because it's a Java string literal.

You don't need to escape /, because you aren't delimiting your regex with slashes (as you would be in Ruby).

Also, (.*)? makes no sense. Just use (.*). * can already match "nothing", so there's no point in having the ?.

Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
1
Pattern.compile("<a href=\"(.*)?/index.html\">.*</a>");

That should fix your regex. You do not need to escape the forward slashes.

However I am obligated to present you with the standard caution against parsing HTML with regex:

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Aurand
  • 5,487
  • 1
  • 25
  • 35
0

You can tell Java what to match and call Pattern.quote(str) to make it escape the correct things for you.

John Humphreys
  • 37,047
  • 37
  • 155
  • 255