-1

I'm trying to edit HTML to be plaintext in java, but I am running into an issue. I am trying to get the number on the padding-left element in the code and transform it into tabs but it doesn't work. ie. <p style="padding-left:40px;">Hello</p> becomes Hello with a tab in front of it.

Here is my code so far (every 40px becomes one tab)

 private static String setNonHTML(String txt)
{
    System.out.println(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")));
    //return "";
    return txt
    .replaceAll("<br>","\n")
    .replaceAll(txt.substring(txt.indexOf("<p style=\"padding-left:"), txt.indexOf("px\"><b>") + 7)
        ,"\n" + repeat("\t",Integer.parseInt(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")))/40))
    .replaceAll(txt.substring(txt.indexOf("<p style=\"padding-left:"), txt.indexOf("px\">") + 4)
        ,"\n" + repeat("\t",Integer.parseInt(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\">")))/40))
    .replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", "\n");
}
Theo
  • 37
  • 6

1 Answers1

0

I cleaned up some of your code to show you what is happening

    private static String setNonHTML(String txt)
    {
        System.out.println(txt.substring(txt.indexOf("<p style=\"padding-left:") + 23, txt.indexOf("px\"><b>")));
        //return "";

        //grab the padding text indexes
        int beforePaddingIndex = txt.indexOf("<p style=\"padding-left:");
        int afterPaddingIndex = txt.indexOf("px\"><b>");


        //replace all breaks with new lines
        txt = txt.replaceAll("<br>", "\n");

        //replaces all instances of 40px\"> with \n\t  
        txt = txt.replaceAll(txt.substring(beforePaddingIndex, afterPaddingIndex + 7), "\n" + repeat("\t", Integer.parseInt(txt.substring(beforePaddingIndex + 23, afterPaddingIndex)) / 40));

        //the indexes of these items have changed because the last operation replaced them. The following items will not have indexes due to the replace operation.
        beforePaddingIndex = txt.indexOf("<p style=\"padding-left:");
        afterPaddingIndex = txt.indexOf("px\"><b>");
        afterPaddingBeforeBoldIndex = txt.indexOf("px\">");

        //replace a substring of the same tag a second time? should find nothing
        txt = txt.replaceAll(txt.substring(beforePaddingIndex, afterPaddingIndex), "\n" + repeat("\t", Integer.parseInt(txt.substring(beforePaddingIndex + 23, afterPaddingBeforeBoldIndex)) / 40));

        txt = txt.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", "\n");

        return txt;
    }

as you can see, after the first replace all, there is a second replace all that takes place on virtually the same indexes. You grab the index of values inline after the first replace all so I set them again to replicate that behavior. Splitting out code into descriptive variables and sections is a good practice and is monumentally helpful when trying to debug complicated sections. I don't know what the output of your program is giving you, so I have no way to know if this actually solves your issue, but it does look like a bug and I believe this might give you a good start.

As for what you should do to fix this, you may want to look into some off the shelf solution like http://htmlcleaner.sourceforge.net/javause.php

That allows you to traverse and modify html programmatically and read off attributes like padding left and the extract content between tags.

Andrew
  • 107
  • 10
  • Thanks! Realized that the replaceAll function replaces all the substrings with the padding of the first one so I just made a while loop until beforePaddingIndex = -1 – Theo Nov 16 '17 at 01:28