4

I am trying to parse and sanitize markdown on the client and server side.

  • On the client side, I use PageDown as a markdown editor. This is exactly what StackOverflow uses, and it comes with a nifty preview box. This preview box shows you sanitized html, so it removes things like <div> tags.

  • On the server side, I'm using PegDown and JSoup to parse and sanitize the markdown.

However, I'm finding cases where the output of the two aren't the same. For example:

Input markdown: how are <div>tags</div> treated?

PageDown output: <p>how are tags treated?</p>

PegDown/JSoup output:

<p>how are </p>tags treated?
<p></p>

I'm not doing anything fancy with JSoup. Here's my code:

public class Main {

    public static void main(String... args){

        PegDownProcessor pdp = new PegDownProcessor();

        String markdown = "how are <div>tags</div> treated?";

        String html = pdp.markdownToHtml(markdown);

        Whitelist whitelist = Whitelist.relaxed().removeTags("div");

        html = Jsoup.clean(html, whitelist);
        System.out.println(html);

        System.out.println("Done.");
    }
}

I understand why this is happening, and I'm not surprised that two different systems generate two different outputs. My question is: how can I setup JSoup so that it simply removes the <div> tags instead of adding extra <p> tags?

My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing. If there are better ways to do that, I'm open to suggestions. I don't really care if the outputs of the two are exactly identical, but things like extra <p> tags are going to be very noticeable by users, so I'm trying to eliminate this one major difference.

Bonus question: is there a list of the html tags and attributes that PageDown can output?

Edit: I've also tried using the OWASP sanitizer, but I get very similar results: the <div> tags are removed, but the <p> tags are "fixed" in the above way, which results in different html than PageDown's sanitizer.

Kevin Workman
  • 41,537
  • 9
  • 68
  • 107

1 Answers1

2

how can I setup JSoup so that it simply removes the <div> tags instead of adding extra <p> tags?

HTML 5 specifications deny the use of div element inside a p element. Jsoup honors those specifications, this why there are two p elements in the final html string.

To better understand why this happen, let's see how the Jsoup#clean works in three steps:

  1. Parse dirty html
  2. Adjust resulting tree to honor HTML 5 specs
  3. Remove denied tags

In Step 2, the first <p> tag is closed just before the opening div. The second p gets its opening tag too in this same step. Since Jsoup doesn't know where the legitimate content of this paragraph starts, it limits the content of this second paragraph to the strict amount (ie nothing).

The actions in Step 1 and 2 create a new HTML code satisfying HTML 5 specifications. In Step 3, the div can now be removed.

My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing.

To avoid other cases like the one spotted here, you should use the same system on both client and on server side. Since Pagedown is written in Javascript, you can try to run it inside a server side Javascript engine.

To name a few:

  • Nashorn (built-in Java 8)
  • Rhino
  • V8

SAMPLE CODE

Here is a sample illustrating the use of Nashorn:

Caller.java

ScriptEngine engine = new ScriptEngineManager().getEngineByName("nashorn");
engine.eval(new FileReader("script.js"));

Invocable invocable = (Invocable) engine;

Object result = invocable.invokeFunction("myFunction", "fooValue");

System.out.println(result);
System.out.println(result.getClass());

script.js

function myFunction(foo) {
   // ...
}

SEE ALSO

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329
  • This doesn't appear to work for me. The output should be `

    how are tags treated?

    `, but if I wrap the markdown in `` tags, the output is simply `how are tags treated?` without any html tags. This is going to be even worse for real-world data, which is much longer than a single line.
    – Kevin Workman Mar 21 '16 at 12:55
  • @KevinWorkman I think that you'll find again and again edge cases where the two systems output diverge. See my update for details... – Stephan Mar 21 '16 at 14:56
  • Yeah, I understand why it's closing the `

    ` tag. I'm looking for a way to disable that feature, or to at least rearrange the sanitization steps so that the `

    ` is removed before the tree is made valid. Funny enough, the server-side JavaScript approach is exactly what I tried first. [That doesn't work because of a known bug](http://stackoverflow.com/questions/32480370/pagedown-through-scriptengine-incorrectly-parsing-markdown) though.
    – Kevin Workman Mar 21 '16 at 16:08
  • Just got V8 working. That seems to do the trick. Now I just have to figure out which natives I need on my server. +1 for now, and if nobody else comes along then I'll mark as correct and award you the bounty. – Kevin Workman Mar 21 '16 at 18:11