1

I have a large HTML String which contains some lines before the actual HTML code which are empty HTML and are not actually needed.

messageContent will contain something like:

        <td width="35"><br /> </td> 
        <td width="1"><br /> </td> 
        <td width="18"><br /> </td> 
        <td width="101"><br /> </td> 
        <td width="7"><br /> </td> 
        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

I want to remove/replace everything before the line which contains 'Geachte', ' heer' and ' mevrouw'.

As output I would like to keep only:

        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

I thought I would use a BufferedReader to loop trough the text line by line:

try {
            reader = new BufferedReader(
                    new StringReader(messageContent));
        } catch (Exception failed) { }


        try {
            while ((string = reader.readLine()) != null) {

                if ((string.length() > 0) && (string.contains("Geachte"))) {
                    //remove all lines before this string
                }
            }
        } catch (IOException e) { }

How do I achieve this?

Nikolay Kuznetsov
  • 9,467
  • 12
  • 55
  • 101
Jef
  • 791
  • 1
  • 18
  • 36
  • gives us example of input and output please – Nikolay Kuznetsov Jan 04 '13 at 10:02
  • and what do you mean by remove? just in string or in file? – Premraj Jan 04 '13 at 10:02
  • @Premraj just the `String` I am not getting the text from a file. – Jef Jan 04 '13 at 10:10
  • just use another String to keep matching lines – Nikolay Kuznetsov Jan 04 '13 at 10:13
  • @NikolayKuznetsov Thanks for your input! But how would it work if I want to add any other line after the matched line to the new `String`? – Jef Jan 04 '13 at 10:21
  • @NikolayKuznetsov I want to get rid of all the lines before the matched `String`. So `replace` them with `""` or just completely ignore them. – Jef Jan 04 '13 at 10:23
  • "I thought I would use a BufferedReader to loop trough the text line by line:" what is the problem you are facing with it? – Nikolay Kuznetsov Jan 04 '13 at 10:23
  • How about lines which come after the matched line? – Nikolay Kuznetsov Jan 04 '13 at 10:24
  • @NikolayKuznetsov Sorry I try to explain it as clear as possible. Lets say I have a String: `content = "this text is completely unnecessary."+"I need to keep this line"+"I need to keep this line also"+"and this one also"+"etc+";` I want to remove everything from that `String` so I would just have: `"I need to keep this line" + "I need to keep this line also"+"and this one also"+"etc";` in the end. – Jef Jan 04 '13 at 10:31

2 Answers2

2

This code will do it.

public String cutText(String messageContent){
    boolean matchFound = false;
    StringBuilder output = new StringBuilder();
    try {
        reader = new BufferedReader(
                new StringReader(messageContent));
    } catch (Exception failed) { failed.printStacktrace(); }


    try {
        while ((string = reader.readLine()) != null) {

            if ((string.length() > 0) && (string.contains("Geachte"))) {
               matchFound = true;
            }
            if(matchFound){
                 output.append(string).append("\\n");
            }
        }
     } catch (IOException e) { e.printStacktrace();}
     return output.toString();
}
ben75
  • 29,217
  • 10
  • 88
  • 134
  • I was also thinking about something like this but OP's explanations are so confusing about what he actually wants to achieve, so decided to refrain from answering. – Nikolay Kuznetsov Jan 04 '13 at 10:41
  • @NikolayKuznetsov Sorry, I tried to explain it as clear as possible, thanks for your help though. – Jef Jan 04 '13 at 10:49
  • @ben75 Thanks for your answer, this is what I was trying to achieve. – Jef Jan 04 '13 at 10:51
  • @Jef, Basically ben75 keeps all the lines which are after the matched line, that is what I was asking about. – Nikolay Kuznetsov Jan 04 '13 at 11:01
1

The easiest will be by using Xpath. First you need to know the correct path to the tr you want to remove. You can do this by using the Chrome Developer Tools (F12 on Linux/Windows, Cmd+Alt+I on Mac), Elements tab, select on the element you want (with the mirror glass), right click and select Copy Xpath.

Since your content is a String (no file), you can just copy paste it once (e.g. when debugging) into an html file and open it with Chrome. It is safer if you give the parent of the faulty block an unique id, since the xpath will be shorter and less likely to change.

This will give you something like:

//*[@id="answers-header"]/div/h2

First you need to convert your String to a Document:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader("your string")));

Then you apply the xpath on your document:

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(<xpath_expression>);
NodeList nl = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

And remove the invalid nodes:

for (int i = 0; i < nodes.getLength(); i++) {
      Element node = (Element)nodes.item(i);
      node.getParentNode().removeChild(person);
}

Then you need to transform the document back to a String.

Community
  • 1
  • 1
asgoth
  • 35,552
  • 12
  • 89
  • 98
  • Thanks for your answer, I am going to experiment with it a bit. A note though, I am not retreiving this HTML from a website, this HTML is fetched from an HTML e-mail which is fetched with JavaMail. – Jef Jan 04 '13 at 10:37
  • Is the structure always the same? Just copy paste it once (e.g. when debugging) into an html file and open it with Chrome. It is safer if you give the parent of the faulty block an id, since the xpath will be shorter and less likely to change. – asgoth Jan 04 '13 at 10:40