0

I'm trying to detect <code>...</code> chunks inside an HTML source code file in order to remove them from the file. I am using Java 8 Pattern and Matcher classes to implement RegEx. For example, this method prints out every <code>...</code> finding.

protected void printSourceCodeChunks() {
  // Design a regular expression to detect code chunks
  String patternString = "<code>.*<\\/code>";
  Pattern pattern = Pattern.compile(patternString);
  Matcher matcher = pattern.matcher(source);
  
  // Loop over findings
  int i = 1;
  while (matcher.find())
    System.out.println(i++ + ": " + matcher.group());
}

A typical output would be:

1: <code> </code>
2: <code></code>
3: <code>System.out.println("Hello World");</code>

As I am using the special character dot and the source code chunks may include line breaks (\n or \r), no code blocks including line breaks will be detected. Fortunately Pattern class can be instructed to include line breaks into the meaning of dot, just adding

  Pattern pattern = Pattern.compile(patternString, Pattern.DOTALL);

The problem with this approach is that only one (fake) <code>...</code> block is detected: the one starting with the first occurrence of <code> and the last occurrence of </code> in the HTML file. The output includes now all the HTML code between these two tags.

How may I alter the regex expression to match every single code block?

Solution proposal

As many of you posted, and for the benefit of future readers, it was that easy as changing my regex by

<code>.*?<\\/code>

as * takes all chars up to the last </code> it finds.

Community
  • 1
  • 1
coterobarros
  • 941
  • 1
  • 16
  • 25
  • 3
    Be kind to yourself and use html parser – ne1410s Jan 31 '19 at 12:17
  • 6
    Don't parse HTML with RegExp: https://stackoverflow.com/a/1732454/345027 – king_nak Jan 31 '19 at 12:18
  • Make the match all expression reluctant, i.e. `.*?` which will make it match as little as possible. However, please be aware that code (Java, Html etc.) is an irregular problem domain and regex are generally no good fit for that. – Thomas Jan 31 '19 at 12:18
  • Aside from @king_nak link it may be worth reading [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/q/701166), [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747) – Pshemo Jan 31 '19 at 12:19
  • Thank you Thomas .*? works fine now. In fact my source is not HTML but an XML dialect that include some HTML tags and some non-HTML tags. It is worth to use regex for this special case but I've learn from your comment that it is not a good solution for the general HTML case. – coterobarros Jan 31 '19 at 12:23

2 Answers2

4

You don't use regex to manipulate html!

Instead, parse the html, for example with jsoup, and remove the elements properly.

String html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p><code>foo</code><code></code><code> </code></body></html>";
Document doc = Jsoup.parse(html);
Elements codes = doc.body().getElementsByTag("code");
codes.remove();
System.out.println(doc.toString());
baao
  • 71,625
  • 17
  • 143
  • 203
  • Thank you @bambam. I use Jsoup elsewhere. I agree using Jsoup is the best solution for HTML and XML tags. I was using regular expressions here because the general case I am in mixes HTML tags with some extra non-XML markup, namely *Markdown* markups. Generally speaking, my source is SGML compliant but not XHTML compliant. In fact, the code I was trying to fix is part of a validator/compilator, that translates *Markdown* markups into regular XHTML tags for further XHTML and Schema validations. – coterobarros Jan 31 '19 at 13:20
2

You can do that with the non-greedy ?:

String patternString = "<code>.*?<\\/code>"

By default the * will match everything it gets, from the first occurance of <code> to the last of </code>. With the questionmark ? it will stop matching at the first occurance.

Though I highly recommend to not "parse" any structure with regex, better use a dedicated HTML parser

Lino
  • 19,604
  • 6
  • 47
  • 65
  • 1
    Using regex to parse html is not really a good idea. There are so many edge cases like spaces or attributes inside the tag. nested tags, tags with no closing tags. If you are certain that you will never have these cases you can get away with it but just remember that sometimes when you use regex to solve a problem you can easily end up with 2 problems. – Damo Jan 31 '19 at 12:57
  • 1
    Yes, I agree with you Damo. This case of mine is an internal and well-controlled case, but parsing anonymous or external HTML with regex surely leads to the issues you comment. Thank you. – coterobarros Jan 31 '19 at 13:24