1

I have to write some code in Java which highlights text of a html file displayed in a JTextPane.

For highlighting I replace "match" with "<span style=\"background-color: #FFFF00\">match</span>" and set the whole replaced text in the JTextPane. Everything works fine! I do this with the help of java.util.regex.Pattern and java.util.regex.Matcher.

Now, I determinded a problem: The matcher also matches text within a html tag. For example this line:

<pre><a name="hello-world">Hello World</a></pre>

I need a regex, to create a java.util.regex.Pattern that only searchs in the String "Hello World".

So, if I want to highlight the matches of "e" it should looks like

<pre><a name="hello-world">H<span style=\"background-color: #FFFF00\">e</span>llo World</a></pre>

Thank you very much for your help!!

mrbela
  • 4,477
  • 9
  • 44
  • 79
  • 5
    http://stackoverflow.com/a/1732454/2235972 – denvercoder9 Dec 12 '16 at 14:13
  • 1
    Don't use a regular expression to replace HTML tags. Crawl the DOM and find what you need to replace. – Mr. Polywhirl Dec 12 '16 at 14:16
  • See http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg . In short, parsing HTML with regex is bad; don’t do it. – VGR Dec 12 '16 at 14:16
  • 1
    I even dont think its possible, since regular Expression cannot express context sensitive languages, but html is context sensitive. – Thomas Philipp Dec 12 '16 at 14:18

2 Answers2

0

I would do something like:

Pattern pattern = Pattern.compile("^>(.*)$<");
Matcher matcher = pattern.matcher(matchedTextBuilder.toString());
while (matcher.find()) {
    String matchedFoundText = matcher.group();
}

A better approach:

public static void main(String[] args) {
    String originalString = "dfedf >Hello< href= ui /> Hello< another";
    StringBuilder sb = new StringBuilder("");
    Pattern pattern = Pattern.compile(">(\\s+)?\\w+(\\s+)?<");
    Matcher matcher = pattern.matcher(originalString);
    int endIndex = 0;
    while (matcher.find()) {
        String matchedFoundText = matcher.group();
        sb.append(originalString.substring(endIndex, matcher.start() + 1));
        sb.append(matchedFoundText.substring(1, matchedFoundText.length() - 1).replaceAll("e",
                "<span style=\"background-color: #FFFF00\">e</span>"));
        sb.append("<");
        endIndex = matcher.end();
    }
    sb.append(originalString.substring(endIndex + 1));
    System.out.println(sb.toString());

}
kimy82
  • 4,069
  • 1
  • 22
  • 25
  • This doesn't work and even if you fix the incorrect usage of `^` and `$`, it won't work constantly, as it fails, then the tag itself contains `<` and `>` (in quotes). – Tom Dec 12 '16 at 15:00
0

Try it with Jsoup a html parser which can be used to scrape and parse HTML from a URL, file, or string but also to manipulate the HTML elements, attributes, and text. Example for your case:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class NewClass2 {

    public static void main(String args[]) {
        String html = " <!DOCTYPE html>\n" +
                        "<html>\n" +
                            "<head>\n" +
                                "<title>Page Title</title>\n" +
                            "</head>\n" +
                            "<body>\n" +
                                "<h1>This is a Heading which should match</h1>\n" +
                                "<p>This is a paragraph which should also match.</p>\n" +
                            "</body>\n" +
                        "</html> ";

        String matchWord = "match";
        Document doc = Jsoup.parse(html);
        System.out.println("before : \n");
        System.out.println(doc.toString()+"\n");

        Elements matchingElements = doc.getElementsContainingOwnText(matchWord);
        for (Element e : matchingElements) {
            e.html(e.html().replace(matchWord,"<span style=\"background-color: #FFFF00\">"+matchWord+"</span>"));
        }
        System.out.println("after : \n");
        System.out.println(doc.toString());
   }
}
Eritrean
  • 15,851
  • 3
  • 22
  • 28