Extracting the first formatted line from some RTF/HTML text

Question

OK, I painted myself into a corner on this one and haven't decided the way out yet.

My web application hosts a series of documents written by users, and edited with the CLEditor editor via PrimeFaces. The documents can be any size and have any formatting the user chooses.

What I want to do is treat the first line of the document as a title, so that when I create a listing of those documents I show only the title, then the user can click on that table row to see the whole document. I show the title with

<h:outputText value="#{backBean.doc}" escape="false" />

What I did is pull the substring of the document out up until but not including the first pattern of the br tag. That works unless the user applies formatting that spans past that. The resulting string has unclosed HTML tags usually div or span) and when they are output without escaping they interfere or even blank out the rest of the page.

So I am looking for an easy solution to fix the HTML fragment. I would rather not import a huge library such as JTidy because it pulls in all sorts of dependencies I don't have right now like a DOM parser, etc. Can anyone suggest a cheaper yet robust solution? Is there any way to clean this up on the client side?

score 2 · Accepted Answer · edited May 23 '17 at 12:20

2

I'd suggest Jsoup.

To parse the HTML and get its <body> content, it's a matter of this oneliner:

String htmlBody = Jsoup.parse(userInput).body().html();

By the way, since you seem to intend to redisplay user-controlled HTML unescaped, I strongly recommend to whitelist it to prevent XSS. E.g.

String safeHtmlBody = Jsoup.clean(htmlBody, Whitelist.basic());

This way you can safely redisplay it without worrying about a XSS attack hole:

<h:outputText value="#{bean.safeHtmlBody}" escape="false" />

Wow, that does look easy. With the solution I posted above I was going to implement my own whitelist capability, but if it has been done then I'm set. But my follow-up question still stands -- is ANY site that uses CLEditor (via Primefaces or otherwise) then re-displays the user input text without processing it basically open to the XSS attack? – AlanObject Apr 02 '12 at 16:06
@AlanObject: if it's developed by an unaware, yes. This is not specifically related to JSF, PrimeFaces, CLEditor or whatever. Redisplaying user-controlled input unescaped/unsanitized is in ANY case a security hole, regardless of the involved programming language, frameworks, etc. – BalusC Apr 02 '12 at 16:14
For anyone else reading this thread, note that the main difference between JTidy and Jsoup that I have found is that JTidy will attempt to close tags -- Jsoup won't. Nonetheless Jsoup is what I will use. – AlanObject Apr 02 '12 at 19:38

score 1 · Answer 2 · answered Apr 01 '12 at 15:28

1

You should be escaping the partial contents of the document somehow, otherwise users can upload documents containing HTML/JavaScript code that will compromise your site. As you can see, even simple formatting can break it. One solution could be to remove all tags (via regex, string replace, etc) and then escape the title.

answered Apr 01 '12 at 15:28

Maciej

2,175
1
18
29

From your description, there is no way to use the CLEdtior in a web page, then re-render the user's formatted text for other users, without opening up an injection security hole. Is this the case with the thousands of sites that use CLEditor? – AlanObject Apr 01 '12 at 16:06
The problem has more to do with the rerendering of unsanitized input. As long as you somehow escape the input before rerendering parts of it you will prevent xss and injection attacks. – Maciej Apr 01 '12 at 17:58

score 0 · Answer 3 · answered Apr 01 '12 at 20:13

I figure out the JTidy way of doing it. This seems very heavy-handed to me but I'm going with it until something better is suggested. Also if someone else is in this situation it might be useful:

public class TitleRTF {    

private static final Pattern pTidy = Pattern.compile("<body>(.*)</body>");

public TitleRTF() {}

public static String getTitle(String rtfSource) {

    org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy(); 
    tidy.setQuiet(true);

    ByteArrayInputStream bais = new ByteArrayInputStream(rtfSource.getBytes());
    org.w3c.dom.Document doc = tidy.parseDOM(new BufferedInputStream(bais), null);
    try {
        Transformer tr = TransformerFactory.newInstance().newTransformer();
        StreamResult result = new StreamResult(new StringWriter());
        NodeList list = doc.getElementsByTagName("body");
        if (list.getLength() > 0) {
            DOMSource source = new DOMSource(list.item(0));
            tr.transform(source, result);
            String text = result.getWriter().toString();
            Matcher m = pTidy.matcher(text);
            if (m.find()) return m.group(1);
        }
    } catch (TransformerException ex) {        }
    return "(not parsable)";
}
}

One thing that needs to be added to this is a way of keeping JTidy from logging what it sees as HTML errors. The setQuiet(true) doesn't seem to do it.

Extracting the first formatted line from some RTF/HTML text

3 Answers3

See also: