How to find all possible regex matches

Question

Is there a way to get all possible match using regex in java :

Java Tester Program :

 public class NewClass {
         public static void main(String[] args) throws IOException {
            String targetFileStr = IOUtils.toString(new FileInputStream(new File("src/SampleHTML.html")), "UTF-8");
            Matcher matcher = Pattern.compile("<body>(.|[\\r\\n])*?<link").matcher(targetFileStr);
            while (matcher.find()) {
                System.out.println(matcher.group());
            }
        }
    }

Sample HTML File like :

<!DOCTYPE html>
<html>
  <body>
        <script src="1"></script>
        <link href="1" />
        <link href="2" />
        <link href="3" />
        <div>TODO write content</div>
    </body>
</html>

Non-Greedy Regex Current Output : Current Output of program is give below in case of non-greedy regex - "<body>(.|[\\r\\n])*<link"

<body>
        <script src="1"></script>
       <link

Greedy Regex Current Output : Current Output of program is give below in case of greedy regex - "<body>(.|[\\r\\n])*?<link"

<body>
            <script src="1"></script>
            <link href="1" />
            <link href="2" />
            <link

Expected Output : But i need to get all possible match from body to link

  1:   <body>
            <script src="1"></script>
           <link

  2:   <body>
            <script src="1"></script>
            <link href="1" />
            <link

  3:   <body>
            <script src="1"></script>
            <link href="1" />
            <link href="2" />
            <link

Why this Question : I am creating tool that will find and highlight all external style sheet in body .

I highly recommend you to use HTML parser instead of using regexes. — Maroun, Nov 24 '13 at 07:45
Most regex engines don't work that way. What about searching for ` — p.s.w.g, Nov 24 '13 at 07:46
Do not use regex to parse HTML. Please read [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Jim Garrison, Nov 24 '13 at 07:48
Is there a java HTML Parser that also gives Line and column number of match ? — hiddenuser, Nov 24 '13 at 07:54

Pshemo · Accepted Answer · 2013-11-24T07:51:56.307

2

Correct approach would be using HTML parser instead of regex. This answer is to show regex mechanism that could help with similar cases that would not involve HTML or any data that already have its parser.

You could use look-behind mechanism to find every <link element that has <body>.* before it and place that "prefix" in some group. Unfortunately in Java look-behind content must have its max length. So you can try something like

String targetFileStr = IOUtils.toString(new FileInputStream(new File(
        "input.txt")), "UTF-8");
Matcher matcher = Pattern.compile("(?<=(<body>.{0,1000}))<link",
        Pattern.DOTALL).matcher(targetFileStr);
while (matcher.find()) {
    System.out.println(matcher.group(1) + matcher.group());
    System.out.println("---------");
}

Output:

<body>
        <script src="1"></script>
        <link
---------
<body>
        <script src="1"></script>
        <link href="1" />
        <link
---------
<body>
        <script src="1"></script>
        <link href="1" />
        <link href="2" />
        <link
---------

edited Nov 24 '13 at 07:51

answered Nov 24 '13 at 07:45

Pshemo

122,468
25
185
269

@downvoter would you be so kind and tell which part of my answer is incorrect? – Pshemo Nov 24 '13 at 07:57
+1 for the "limited variable length" lookbehind. – Casimir et Hippolyte Nov 24 '13 at 09:07
Thanks, May you explain (?<=(.{0,1000})) this – hiddenuser Nov 24 '13 at 09:57
@user2998596 `(?<=...)` is [positive look behind](http://www.regular-expressions.info/lookaround.html#lookbehind). Combined with `(?<=..) – Pshemo Nov 24 '13 at 10:11
@user2998596 but seriously, combining regex with HTML is very bad idea. You really should try to solve it with some HTML or XML parser. – Pshemo Nov 24 '13 at 10:12

How to find all possible regex matches

1 Answers1