I'm trying to detect <code>...</code>
chunks inside an HTML source code file in order to remove them from the file.
I am using Java 8 Pattern and Matcher classes to implement RegEx. For example, this method prints out every <code>...</code>
finding.
protected void printSourceCodeChunks() {
// Design a regular expression to detect code chunks
String patternString = "<code>.*<\\/code>";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(source);
// Loop over findings
int i = 1;
while (matcher.find())
System.out.println(i++ + ": " + matcher.group());
}
A typical output would be:
1: <code> </code>
2: <code></code>
3: <code>System.out.println("Hello World");</code>
As I am using the special character dot and the source code chunks may include line breaks (\n or \r), no code blocks including line breaks will be detected. Fortunately Pattern class can be instructed to include line breaks into the meaning of dot, just adding
Pattern pattern = Pattern.compile(patternString, Pattern.DOTALL);
The problem with this approach is that only one (fake) <code>...</code>
block is detected: the one starting with the first occurrence of <code>
and the last occurrence of </code>
in the HTML file. The output includes now all the HTML code between these two tags.
How may I alter the regex expression to match every single code block?
Solution proposal
As many of you posted, and for the benefit of future readers, it was that easy as changing my regex by
<code>.*?<\\/code>
as * takes all chars up to the last </code>
it finds.