Changing HTML to PlainText

Question

I'm trying to edit HTML to be plaintext in java, but I am running into an issue. I am trying to get the number on the padding-left element in the code and transform it into tabs but it doesn't work. ie. <p style="padding-left:40px;">Hello</p> becomes Hello with a tab in front of it.

Here is my code so far (every 40px becomes one tab)

 private static String setNonHTML(String txt)
{
    System.out.println(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")));
    //return "";
    return txt
    .replaceAll("<br>","\n")
    .replaceAll(txt.substring(txt.indexOf("<p style=\"padding-left:"), txt.indexOf("px\"><b>") + 7)
        ,"\n" + repeat("\t",Integer.parseInt(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")))/40))
    .replaceAll(txt.substring(txt.indexOf("<p style=\"padding-left:"), txt.indexOf("px\">") + 4)
        ,"\n" + repeat("\t",Integer.parseInt(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\">")))/40))
    .replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", "\n");
}

@Stewart class project and we can't use external libraries :/ — Theo, Nov 15 '17 at 23:57
Read the answer to this question to learn about parsing HTML with regex. Make sure to pass it on to your tutor ... https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Stewart, Nov 15 '17 at 23:58
(Use an XML parser or JSoup instead. That's how it's done in industry.) — Stewart, Nov 15 '17 at 23:59

Andrew · Accepted Answer · 2017-11-16T05:12:27.363

I cleaned up some of your code to show you what is happening

    private static String setNonHTML(String txt)
    {
        System.out.println(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")));
        //return "";

        //grab the padding text indexes
        int beforePaddingIndex = txt.indexOf("<p style=\"padding-left:");
        int afterPaddingIndex = txt.indexOf("px\"><b>");


        //replace all breaks with new lines
        txt = txt.replaceAll("<br>", "\n");

        //replaces all instances of 40px\"> with \n\t  
        txt = txt.replaceAll(txt.substring(beforePaddingIndex, afterPaddingIndex + 7), "\n" + repeat("\t", Integer.parseInt(txt.substring(beforePaddingIndex + 23, afterPaddingIndex)) / 40));

        //the indexes of these items have changed because the last operation replaced them. The following items will not have indexes due to the replace operation.
        beforePaddingIndex = txt.indexOf("<p style=\"padding-left:");
        afterPaddingIndex = txt.indexOf("px\"><b>");
        afterPaddingBeforeBoldIndex = txt.indexOf("px\">");

        //replace a substring of the same tag a second time? should find nothing
        txt = txt.replaceAll(txt.substring(beforePaddingIndex, afterPaddingIndex), "\n" + repeat("\t", Integer.parseInt(txt.substring(beforePaddingIndex + 23, afterPaddingBeforeBoldIndex)) / 40));

        txt = txt.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", "\n");

        return txt;
    }

as you can see, after the first replace all, there is a second replace all that takes place on virtually the same indexes. You grab the index of values inline after the first replace all so I set them again to replicate that behavior. Splitting out code into descriptive variables and sections is a good practice and is monumentally helpful when trying to debug complicated sections. I don't know what the output of your program is giving you, so I have no way to know if this actually solves your issue, but it does look like a bug and I believe this might give you a good start.

As for what you should do to fix this, you may want to look into some off the shelf solution like http://htmlcleaner.sourceforge.net/javause.php

That allows you to traverse and modify html programmatically and read off attributes like padding left and the extract content between tags.

Thanks! Realized that the replaceAll function replaces all the substrings with the padding of the first one so I just made a while loop until beforePaddingIndex = -1 — Theo, Nov 16 '17 at 01:28

Changing HTML to PlainText

1 Answers1