1

Match pattern foo but not incase if it occurs after pattern bar. Basically given a string,I am "trying" to match opening tag <any string> and the match should not occur if it is after closing tag </any string>.

Note : I am "trying" some approach like this to solve, this might not be the actual path to the solution. I would be happy if you can help with current issue.

So it should match:
<h1> in <h1>
<h1> in <h1> abc </h1>
<abc> in <abc>something</cde><efg>
<abc> in something<abc>something

Should not match anything in:
</h1>
</abc> one two three <abc> five six <abc>
one two three </abc> five six <abc>

sql_dummy
  • 715
  • 8
  • 23
  • 1
    Could you explain a little more, you basically want it to match the first occurrence of `` so long as there is no `` anywhere before it? What is the ultimate goal for this regex? Knowing that will help me know what you want. – xtratic Mar 28 '18 at 14:26
  • yes, I am trying to parse `HTML` string and as I said what I am trying to do might not be solution but I also want learn regex with this issue. – sql_dummy Mar 28 '18 at 14:32
  • *`<`any thing`>`* wouldn't match `` if *any thing* doesn't start with a slash. – revo Mar 28 '18 at 14:33
  • 3
    https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags See this "post" for the best answer :) – dognose Mar 28 '18 at 14:51

1 Answers1

0

The easiest solution is to outsource some of the work to the java regex API. With regexes we can match only <[^>]*>, i.e. any html tag. Then we can use Matcher.region() to limit the matches to the strings that come before any </.

Here is the code:

    // example data
    String[] inputLines = {
            "<h1>",
            "<h1> abc </h1>",
            "<abc>something</cde><efg>",
            "something<abc>something",
            "",
            "</h1>",
            "</abc> one two three <abc> five six <abc>",
            "one two three </abc> five six <abc>"
    };

    // the pattern for any html tag
    Pattern pattern = Pattern.compile("<[^>]*>");

    for (String line : inputLines) {
        Matcher matcher = pattern.matcher(line);
        // the index that we must not search after
        int undesiredStart = line.indexOf("</");

        //  undesiredStart == -1 ? line.length() : undesiredStart handles the undesired not found case. In that case the region end must be the length of the string
        matcher.region(0, undesiredStart == -1 ? line.length() : undesiredStart);

        // this is the idiom to iterate through the matches
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
Tamas Rev
  • 7,008
  • 5
  • 32
  • 49