Regex to remove email address from html

Question

I have an html input from a method from which i need to remove the email address. The problem is the email address is not coming inside a div. Its split across multiple divs. Find the sample input below

div  class="p" id="p9" style="top:89.17999pt;left:430.7740pt;font-family:Times New Roman;font-size:1.0pt;">hello</div>
div class="p" id="p10" style="top:89.17999pt;left:484.100pt;font-family:Times New Roman;font-size:1.0pt;">.</div>
div class="p" id="p11" style="top:89.17999pt;left:487.100pt;font-family:Times New Roman;font-size:1.0pt;">p</div>
<div class="p" id="p1" style="top:89.17999pt;left:493.9300pt;font-family:Times New Roman;font-size:1.0pt;">@</div>
div class="p" id="p13" style="top:89.17999pt;left:0.09003pt;font-family:Times New Roman;font-size:1.0pt;">gmail</div>
div class="p" id="p" style="top:89.17999pt;left:33.18pt;font-family:Times New Roman;font-size:1.0pt;">.</div>
<div class="r" style="left:79.84pt;bottom:9.pt;width:479.98004pt;height:1.71997pt;background-color:#d9d9d9;">&nbsp;</div>
div class="p" id="p1" style="top:89.17999pt;left:3.18pt;font-family:Times New Roman;font-size:1.0pt;">com</div>"

and the regex we are using is [A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,6} which only gives standard email format. Any help will be really appreciated.

Edit: removed the div start tag as it was parsed to text by the page.

Is that the whole text that would need to be regex'd? Or is there more lines before or after this? — Jose Martinez, Oct 03 '15 at 04:48
@JoseMartinez I am sorry i think i did not understand your question properly. There are more lines before and after this. I put only the email portion of the html. I hope I made it clear. — t10011, Oct 03 '15 at 04:54
Don't use regex to parse HTML http://stackoverflow.com/q/373833/5290909 — Mariano, Oct 03 '15 at 04:55
If I can't see the whole text that I am regexing then I can't do it properly. — Jose Martinez, Oct 03 '15 at 04:58
@JoseMartinez other part of the html is not relevant. since it can change regularly. The html is not meant to be the same all the time. It will wary depending upon the page it parse — t10011, Oct 03 '15 at 05:11
@Mariano we are not using regex to parse the HTML. Sorry if my question mislead you. We are using regex only to remove the email address from the html. — t10011, Oct 03 '15 at 05:12
Ok. Well this is what I was doing, let me know if its worth continueing. I broke he HTML into separate lines. Was going through each line and going to yank out the text in between the ">" and " — Jose Martinez, Oct 03 '15 at 05:14
Please show the relevant part of your parser. You should check the parents innerText and remove all children. — Mariano, Oct 03 '15 at 05:17
@JoseMartinez The approach we are trying to follow is to find out the '@' symbol from the html and the find the preceding divs from it. This we will get the first part of the email address. we found two issues in it. First we need to store the divs in a string array till we hit a '@' symbol to remove the preceding divs, which can have latency since we are doing it on the fly and we need to marge the rest of the content with that in the string array. Second is we are not sure on how many preceding divs we need to read for the email first part. We are looking for the first problem solution. — t10011, Oct 03 '15 at 05:25
@Mariano We are not using just one parser. We are converting files from doc,docx,html,pdf,plain text,xml etc. All of these are converted to htmls with a predefined format. So we cannot have common solution for all. Multiple files are having issues. So only common point is the html — t10011, Oct 03 '15 at 05:27
Still the same applies. It's HTML... what tool are you using to parse it? — Mariano, Oct 03 '15 at 05:31

Jose Martinez · Answer 1 · 2015-10-03T05:35:18.117

This worked for me.

public static void main(String[] args) {
    String text = "div  class=\"p\" id=\"p9\" style=\"top:89.17999pt;left:430.7740pt;font-family:Times New Roman;font-size:1.0pt;\">hello</div>\n"
            + "div class=\"p\" id=\"p10\" style=\"top:89.17999pt;left:484.100pt;font-family:Times New Roman;font-size:1.0pt;\">.</div>\n"
            + "div class=\"p\" id=\"p11\" style=\"top:89.17999pt;left:487.100pt;font-family:Times New Roman;font-size:1.0pt;\">p</div>\n"
            + "<div class=\"p\" id=\"p1\" style=\"top:89.17999pt;left:493.9300pt;font-family:Times New Roman;font-size:1.0pt;\">@</div>\n"
            + "div class=\"p\" id=\"p13\" style=\"top:89.17999pt;left:0.09003pt;font-family:Times New Roman;font-size:1.0pt;\">gmail</div>\n"
            + "div class=\"p\" id=\"p\" style=\"top:89.17999pt;left:33.18pt;font-family:Times New Roman;font-size:1.0pt;\">.</div>\n"
            + "<div class=\"r\" style=\"left:79.84pt;bottom:9.pt;width:479.98004pt;height:1.71997pt;background-color:#d9d9d9;\">&nbsp;</div>\n"
            + "div class=\"p\" id=\"p1\" style=\"top:89.17999pt;left:3.18pt;font-family:Times New Roman;font-size:1.0pt;\">com</div>\"";

    StringBuilder sb = new StringBuilder();
    String[] tokens = text.split("\n");

    Pattern p = Pattern.compile(".*>(.*)</div.*");

    for (String line : tokens) {
        Matcher m = p.matcher(line);
        if (m.matches()) {
            sb.append(m.group(1));
        }
    }

    System.out.println(sb.toString());
}

EDIT: You may need to adjust the Pattern if there will be more divs to only match on just the div's for the email.

Is there any text in these divs that make them unique to just email address? If there is then we can include it in the Pattern regex. — Jose Martinez, Oct 03 '15 at 05:33
If there's more than one `
` in the same line, the code breaks. — Mariano, Oct 03 '15 at 05:39

Regex to remove email address from html

1 Answers1