I have a .docx document which looks like this:
1. Part One
Some <message>xml</message> text
2. Part Two
Some plain text
I want to get the plain text content and remove all xml-tags. Therefore I use Apache poi and Jsoup:
XWPFDocument docx = new XWPFDocument(new FileInputStream(path));
XWPFWordExtractor wordxExtractor = new XWPFWordExtractor(docx);
String content = wordxExtractor.getText();
System.out.println(content);
content = Jsoup.parse(content).text();
System.out.println(content);
This prints:
Part One
Some <message>xml</message> text
Part Two
Some plain text
and:
Part One Some xml text Part Two Some plain text
The problem now is: After xml-parsing, the line breaks don't appear anymore - how can I avoid this (I want to keep the line breaks) ?
-tags, but here I don't have such tags. – Munchkin Apr 21 '15 at 09:25
]*>", "br2n")` with `.replaceAll("\n", "br2n")` – Munchkin Apr 21 '15 at 09:29