0

I have a .docx document which looks like this:

1. Part One
   Some <message>xml</message> text
2. Part Two
   Some plain text

I want to get the plain text content and remove all xml-tags. Therefore I use Apache poi and Jsoup:

XWPFDocument docx = new XWPFDocument(new FileInputStream(path));
XWPFWordExtractor wordxExtractor = new XWPFWordExtractor(docx);
String content = wordxExtractor.getText();
System.out.println(content);
content = Jsoup.parse(content).text();
System.out.println(content);

This prints:

Part One
Some <message>xml</message> text
Part Two
Some plain text 

and:

Part One Some xml text Part Two Some plain text

The problem now is: After xml-parsing, the line breaks don't appear anymore - how can I avoid this (I want to keep the line breaks) ?

Munchkin
  • 4,528
  • 7
  • 45
  • 93
  • Hm, I don't think so: in your suggested question, the line breaks got removed because of
    -tags, but here I don't have such tags.
    – Munchkin Apr 21 '15 at 09:25
  • 1
    Oh wait, it's pretty similar: In the top answer, I just had to replace `.replaceAll("(?i)
    ]*>", "br2n")` with `.replaceAll("\n", "br2n")`
    – Munchkin Apr 21 '15 at 09:29

0 Answers0