3

I am reading text from URL using Jsoup. Following link has some tips to preserve new lines when converting the body to text How do I preserve line breaks when using jsoup to convert html to plain text?

I use following lines to convert the tags

  String prettyPrintedBodyFragment = Jsoup.clean(body, "", Whitelist
            .none().addTags("br", "p",  "h1"), new OutputSettings()
            .prettyPrint(true));
  System.out.println(prettyPrintedBodyFragment);

I still get the body/content in single line. Any clues pl?

EDIT: Here is the complete source code and I see output in only 1 line

 public static void main(String[] args) throws Exception {

        Connection conn = Jsoup.connect("http://finance.yahoo.com/");
        Document doc  = conn.get();

         String body = doc.body().text();

        String prettyPrintedBodyFragment = Jsoup.clean(body, "", Whitelist
                .none().addTags("br", "p",  "h1"), new OutputSettings()
                .prettyPrint(true));

        System.out.println(prettyPrintedBodyFragment);



    }
Community
  • 1
  • 1
kashili kashili
  • 955
  • 4
  • 15
  • 31

1 Answers1

1

Change:

String body = doc.body().text();

To:

String body = doc.body().html();

Since you are already dumping the tags, your Whitelist has no way to include them while formatting your text.

StoopidDonut
  • 8,547
  • 2
  • 33
  • 51