0

Newest Update: This seems to be a problem with the matcher, not the expression itself. I tested it a little more and using the Pattern/Matcher on an input string causes the problem. The input string having meta characters causes the matcher to skip over a match. If I just use a simple .replaceAll with the same expression it finds it just fine. I tried to use Pattern.quote on the input string but didn't change anything. So I'm still stuck. Why does the matcher not find a match if meta characters in the input string exist? And is there a way to make the input string ignore meta characters in regards to the matcher?


I am trying to do a regex on a large string to pull out all html links from the start of the tag to the closing tag. I came up with this expression:

<a.*?</a>

Which does a pretty good job. It gets almost all of them. My problem is if there is parenthesis inside the string like:

<a href="blahblah">myproblem()</a>

The matcher completely skips this link. I thought that the .*? would pick up everything from the space after the first a to the open bracket of the closing a tag but it doesn't if there are any parenthesis.

What am I missing here?

EDIT for clarification:

I am using java. Here is what I am doing for testing this before adding to my project. When I run this it fails, but if I take out the () on test, it passes. With the () I'm pretty sure it isn't even being added to the list:

String tryConvert = doclet.htmlToWiki("<a href=\"#test.method\">test()</a>");
assertThat(tryConvert, is("[test()|test#method]"));

And the htmlToWiki code:

ArrayList<String> links = new ArrayList<String>();
    Pattern linkPattern = Pattern.compile("<a.*?</a>", Pattern.DOTALL);
    Matcher matcher = linkPattern.matcher(html);
    while (matcher.find())
    {
        links.add(matcher.group());
    }

    for (String link : links)
    {
        String original = link;
        String alias = link.replaceAll("<a.*?>", "");
        alias = alias.replaceAll("</a>", "");
        link = link.replaceAll("\">.*?</a>", "]");
        link = link.replaceAll("<a.*#", "[");
        link = link.replaceAll("\\.", "#");
        link = link.replace("[", "[" + alias + "|");
        html = html.replaceAll(original, link);
    }
Mimerr
  • 390
  • 1
  • 5
  • 14
  • What is the `?` supposed to do exactly? Oh, and this expression also picks up elements whose tag name starts with an "a", such as ``, ``, `` and so on. Also, [this](http://stackoverflow.com/a/1732454/1016716). – Mr Lister Jul 12 '13 at 17:08
  • I'm new to regex, ? was explained to me as 0 or 1 of the previous expression, I just put it there because I had seen .*? as a kind of 'catch all'. I didn't really think about those other tags, so thanks, but for now I'm just trying to understand why the () is messing things up. – Mimerr Jul 12 '13 at 17:43

2 Answers2

2

Without seeing the JavaScript you're using it's hard to tell exactly what's wrong. Perhaps there are too many escape characters (which really aren't needed here anyway). This works for me:

var input = 'foo <a href="blahblah">myproblem()</a> bar';
var match = input.match(/<a.*?<\/a>/);
alert(match[0]); // <a href="blahblah">myproblem()</a>

Alternatively:

var input = 'foo <a href="blahblah">myproblem()</a> bar';
var match = RegExp('<a.*?</a>').exec(input);
alert(match[0]); // <a href="blahblah">myproblem()</a>
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • Thanks for the quick answer. I will add more detail in the original post, maybe will help clarify some things. – Mimerr Jul 12 '13 at 17:17
  • I removed the escape characters for the < and > and everything still works the same way(..not sure why I thought they needed them.) So my expression should be working... I really have no clue why it isn't. It worked on http://www.regexplanet.com/advanced/java/index.html too.. – Mimerr Jul 12 '13 at 20:54
  • @user2395495 are you positive that it's not matching? Could the error be elsewhere in `htmlToWiki`? – p.s.w.g Jul 12 '13 at 20:56
  • Yeah, I cut it down to just that section and tried again. When the string between the anchor tags has () it doesn't even add to the list. – Mimerr Jul 12 '13 at 20:59
0

After a lot of testing and such I figured out that my pattern and matcher wasn't the probem after all. The problem with my code was that in the last replaceAll method the original is another REGEX pattern, not a literal. So It was finding the meta characters and not doing what I expected.

If you are trying something similar when you go to do the final replaceAll surround your original variable with Pattern.quote().

Pattern.quote(original)

This will make it treat the original as its literal form essentially.

Thanks for the help everyone, I guess my question was misleading from me not realizing such a small thing(isn't that always the case!?)

Mimerr
  • 390
  • 1
  • 5
  • 14