Matching plan text to HTML content

Question

I need to perform matching between plant text to HTML content and once matched found, I need to extract the matched HTML contents (without changing the HTML content As I need the exactly same HTML content) , I am able to match in many scenarios using java regex utility but it is failing in below scenarios.

Below is the sample code I am using to match Text with HTML String

public static void main(String[] args) {

    String text = "A crusader for the rights of the weaker sections of the Association&#39;s (ADA&#39;s),choice as the presidential candidate is being seen as a political masterstroke.";
    String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");

    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    // Check all occurrences
    while (matcher.find()) {

        System.out.print("Start index: " + matcher.start());
        System.out.print(" End index: " + matcher.end());
        System.out.println(" Found: " + matcher.group());

    }
}

Below the edge cases are getting failed

Case 1:

Source Text: = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";

Text to match = "A crusader for the rights of the weaker sections of the Association's (ADA's)"

Expected output: “A crusader for the rights of the weaker sections of the Association's (ADA's)”

Case 2:

Source Text:

“<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”

Text to match: “see (HTML Content Sample.)”

Expected output: “see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)”

Case 3: Source Text: = "Initial history includes the following:</p>\n<p>Documentation of <li>Aliquam tincidunt mauris eu risus.</li>"

Text to match = "Initial history includes the following: Documentation of"

Expected output from matching:”Initial history includes the following :</p>\n<p>Documentation of”

first there are some characters is reserved in regex, for example the dot, the parenthesis `()` how you can deal with this? — Youcef LAIDANI, Jun 19 '17 at 15:37
I know this isn't very helpful but I wouldn't personally recommend RegEx for html manipulation for reasons here. There may also be some answers that help you if you absolutely have to use RegEx. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Ryan - Llaver, Jun 19 '17 at 15:50
mmm, this is not easy problem @pankajdesai because there are many cases you have to avoid not just the parenthesis but also the dot and other reserved character in regex :) — Youcef LAIDANI, Jun 22 '17 at 14:24

Leo Aso · Answer 1 · 2017-06-23T14:35:12.837

I recently came up with a regular expression to match HTML tags, with support for quoted attributes and escaped quotes within quoted attributes: It goes like
<([^'">]|"([^\\"]|\\"?)+"|'([^\\']|\\'?)+')+>.

I think the easiest way to search for plain text in HTML while preserving the HTML is to modify the plain text so that it ignores tags at word boundaries, à la

// Usage: htmlSearch("ab cd").matcher("<b>ab</b> <i>cd</i>").matches();
public static Pattern htmlSearch(String plain) {
    // Check for tags before and after every word, number and symbol
    plain = plain.replaceAll("[A-Za-z]+|\\d+|[^\\w\\s]", 
            "``TAGS``$0``TAGS``";
    // Check for tags wherever (one or more) spaces are found
    plain = plain.replaceAll("\\s+", "((\\s|&nbsp;)+|``TAGS``)*");
    // Handle special characters
    plain = plain
            .replace("<", "(<|&lt;|&#60;)")
            .replace(">", "(>|&gt;|&#62;)")
            .replace("&", "(&|&amp;|&#38;)")
            .replace("'", "('|&apos;|&#39;)")
            .replace("\"", "(\"|&quot;|&#34;)")
            .replaceAll("[()\\\\{}\\[\\].*+]", "\\$0");
    // Insert the ``TAGS`` pattern
    String tags = "(<([^'\">]"
                + "|\"([^\\\"]|\\\"?)+"
                + "|'([^\\']|\\'?)+')+>)*";
    plain = plain.replace("``TAGS``", tags);

    return Pattern.compile(plain);
}

Matching plan text to HTML content

1 Answers1