2

I need to perform matching between plant text to HTML content and once matched found, I need to extract the matched HTML contents (without changing the HTML content As I need the exactly same HTML content) , I am able to match in many scenarios using java regex utility but it is failing in below scenarios.

Below is the sample code I am using to match Text with HTML String

public static void main(String[] args) {

    String text = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";
    String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");

    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    // Check all occurrences
    while (matcher.find()) {

        System.out.print("Start index: " + matcher.start());
        System.out.print(" End index: " + matcher.end());
        System.out.println(" Found: " + matcher.group());

    }
}

Below the edge cases are getting failed

Case 1:

Source Text: = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";

Text to match = "A crusader for the rights of the weaker sections of the Association's (ADA's)"

Expected output: “A crusader for the rights of the weaker sections of the Association's (ADA's)”

Case 2:

Source Text:

“<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”

Text to match: “see (HTML Content Sample.)”

Expected output: “see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)”

Case 3: Source Text: = "Initial history includes the following:</p>\n<p>Documentation of <li>Aliquam tincidunt mauris eu risus.</li>"

Text to match = "Initial history includes the following: Documentation of"

Expected output from matching:”Initial history includes the following :</p>\n<p>Documentation of”

Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140
pankaj desai
  • 93
  • 1
  • 8
  • first there are some characters is reserved in regex, for example the dot, the parenthesis `()` how you can deal with this? – Youcef LAIDANI Jun 19 '17 at 15:37
  • I know this isn't very helpful but I wouldn't personally recommend RegEx for html manipulation for reasons here. There may also be some answers that help you if you absolutely have to use RegEx. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Ryan - Llaver Jun 19 '17 at 15:50
  • @YCF_L For parenthesis () I am replacing with space – pankaj desai Jun 19 '17 at 15:52
  • @YCF_L any Idea about above problem statement ? – pankaj desai Jun 22 '17 at 14:18
  • mmm, this is not easy problem @pankajdesai because there are many cases you have to avoid not just the parenthesis but also the dot and other reserved character in regex :) – Youcef LAIDANI Jun 22 '17 at 14:24

1 Answers1

0

I recently came up with a regular expression to match HTML tags, with support for quoted attributes and escaped quotes within quoted attributes: It goes like
<([^'">]|"([^\\"]|\\"?)+"|'([^\\']|\\'?)+')+>.

I think the easiest way to search for plain text in HTML while preserving the HTML is to modify the plain text so that it ignores tags at word boundaries, à la

// Usage: htmlSearch("ab cd").matcher("<b>ab</b> <i>cd</i>").matches();
public static Pattern htmlSearch(String plain) {
    // Check for tags before and after every word, number and symbol
    plain = plain.replaceAll("[A-Za-z]+|\\d+|[^\\w\\s]", 
            "``TAGS``$0``TAGS``";
    // Check for tags wherever (one or more) spaces are found
    plain = plain.replaceAll("\\s+", "((\\s|&nbsp;)+|``TAGS``)*");
    // Handle special characters
    plain = plain
            .replace("<", "(<|&lt;|&#60;)")
            .replace(">", "(>|&gt;|&#62;)")
            .replace("&", "(&|&amp;|&#38;)")
            .replace("'", "('|&apos;|&#39;)")
            .replace("\"", "(\"|&quot;|&#34;)")
            .replaceAll("[()\\\\{}\\[\\].*+]", "\\$0");
    // Insert the ``TAGS`` pattern
    String tags = "(<([^'\">]"
                + "|\"([^\\\"]|\\\"?)+"
                + "|'([^\\']|\\'?)+')+>)*";
    plain = plain.replace("``TAGS``", tags);

    return Pattern.compile(plain);
}
Leo Aso
  • 11,898
  • 3
  • 25
  • 46