1

Is there a way to get all possible match using regex in java :

Java Tester Program :

 public class NewClass {
         public static void main(String[] args) throws IOException {
            String targetFileStr = IOUtils.toString(new FileInputStream(new File("src/SampleHTML.html")), "UTF-8");
            Matcher matcher = Pattern.compile("<body>(.|[\\r\\n])*?<link").matcher(targetFileStr);
            while (matcher.find()) {
                System.out.println(matcher.group());
            }
        }
    }

Sample HTML File like :

<!DOCTYPE html>
<html>
  <body>
        <script src="1"></script>
        <link href="1" />
        <link href="2" />
        <link href="3" />
        <div>TODO write content</div>
    </body>
</html>

Non-Greedy Regex Current Output : Current Output of program is give below in case of non-greedy regex - "<body>(.|[\\r\\n])*<link"

<body>
        <script src="1"></script>
       <link

Greedy Regex Current Output : Current Output of program is give below in case of greedy regex - "<body>(.|[\\r\\n])*?<link"

<body>
            <script src="1"></script>
            <link href="1" />
            <link href="2" />
            <link

Expected Output : But i need to get all possible match from body to link

  1:   <body>
            <script src="1"></script>
           <link

  2:   <body>
            <script src="1"></script>
            <link href="1" />
            <link

  3:   <body>
            <script src="1"></script>
            <link href="1" />
            <link href="2" />
            <link

Why this Question : I am creating tool that will find and highlight all external style sheet in body .

hiddenuser
  • 525
  • 2
  • 7
  • 19

1 Answers1

2

Correct approach would be using HTML parser instead of regex. This answer is to show regex mechanism that could help with similar cases that would not involve HTML or any data that already have its parser.


You could use look-behind mechanism to find every <link element that has <body>.* before it and place that "prefix" in some group. Unfortunately in Java look-behind content must have its max length. So you can try something like

String targetFileStr = IOUtils.toString(new FileInputStream(new File(
        "input.txt")), "UTF-8");
Matcher matcher = Pattern.compile("(?<=(<body>.{0,1000}))<link",
        Pattern.DOTALL).matcher(targetFileStr);
while (matcher.find()) {
    System.out.println(matcher.group(1) + matcher.group());
    System.out.println("---------");
}

Output:

<body>
        <script src="1"></script>
        <link
---------
<body>
        <script src="1"></script>
        <link href="1" />
        <link
---------
<body>
        <script src="1"></script>
        <link href="1" />
        <link href="2" />
        <link
---------
Pshemo
  • 122,468
  • 25
  • 185
  • 269