I need to perform matching between plant text to HTML content and once matched found, I need to extract the matched HTML contents (without changing the HTML content As I need the exactly same HTML content) , I am able to match in many scenarios using java regex utility but it is failing in below scenarios.
Below is the sample code I am using to match Text with HTML String
public static void main(String[] args) {
String text = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";
String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
// Check all occurrences
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
}
Below the edge cases are getting failed
Case 1:
Source Text: = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke."
;
Text to match = "A crusader for the rights of the weaker sections of the Association's (ADA's)"
Expected output: “A crusader for the rights of the weaker sections of the Association's (ADA's)”
Case 2:
Source Text:
“<ul>
<li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
<li>Aliquam tincidunt mauris eu risus.</li>
<li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”
Text to match: “see (HTML Content Sample.)”
Expected output: “see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)”
Case 3:
Source Text: = "Initial history includes the following:</p>\n<p>Documentation of <li>Aliquam tincidunt mauris eu risus.</li>"
Text to match = "Initial history includes the following: Documentation of"
Expected output from matching:”Initial history includes the following :</p>\n<p>Documentation of”