Matching PegDown+JSoup Output to PageDown Output

Question

I am trying to parse and sanitize markdown on the client and server side.

On the client side, I use PageDown as a markdown editor. This is exactly what StackOverflow uses, and it comes with a nifty preview box. This preview box shows you sanitized html, so it removes things like <div> tags.
On the server side, I'm using PegDown and JSoup to parse and sanitize the markdown.

However, I'm finding cases where the output of the two aren't the same. For example:

Input markdown: how are <div>tags</div> treated?

PageDown output: <p>how are tags treated?</p>

PegDown/JSoup output:

<p>how are </p>tags treated?
<p></p>

I'm not doing anything fancy with JSoup. Here's my code:

public class Main {

    public static void main(String... args){

        PegDownProcessor pdp = new PegDownProcessor();

        String markdown = "how are <div>tags</div> treated?";

        String html = pdp.markdownToHtml(markdown);

        Whitelist whitelist = Whitelist.relaxed().removeTags("div");

        html = Jsoup.clean(html, whitelist);
        System.out.println(html);

        System.out.println("Done.");
    }
}

I understand why this is happening, and I'm not surprised that two different systems generate two different outputs. My question is: how can I setup JSoup so that it simply removes the <div> tags instead of adding extra <p> tags?

My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing. If there are better ways to do that, I'm open to suggestions. I don't really care if the outputs of the two are exactly identical, but things like extra <p> tags are going to be very noticeable by users, so I'm trying to eliminate this one major difference.

Bonus question: is there a list of the html tags and attributes that PageDown can output?

Edit: I've also tried using the OWASP sanitizer, but I get very similar results: the <div> tags are removed, but the <p> tags are "fixed" in the above way, which results in different html than PageDown's sanitizer.

score 2 · Accepted Answer · edited May 23 '17 at 12:22

how can I setup JSoup so that it simply removes the <div> tags instead of adding extra <p> tags?

HTML 5 specifications deny the use of div element inside a p element. Jsoup honors those specifications, this why there are two p elements in the final html string.

To better understand why this happen, let's see how the Jsoup#clean works in three steps:

Parse dirty html
Adjust resulting tree to honor HTML 5 specs
Remove denied tags

In Step 2, the first <p> tag is closed just before the opening div. The second p gets its opening tag too in this same step. Since Jsoup doesn't know where the legitimate content of this paragraph starts, it limits the content of this second paragraph to the strict amount (ie nothing).

The actions in Step 1 and 2 create a new HTML code satisfying HTML 5 specifications. In Step 3, the div can now be removed.

My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing.

To avoid other cases like the one spotted here, you should use the same system on both client and on server side. Since Pagedown is written in Javascript, you can try to run it inside a server side Javascript engine.

To name a few:

Nashorn (built-in Java 8)
Rhino
V8

SAMPLE CODE

Here is a sample illustrating the use of Nashorn:

Caller.java

ScriptEngine engine = new ScriptEngineManager().getEngineByName("nashorn");
engine.eval(new FileReader("script.js"));

Invocable invocable = (Invocable) engine;

Object result = invocable.invokeFunction("myFunction", "fooValue");

System.out.println(result);
System.out.println(result.getClass());

script.js

function myFunction(foo) {
   // ...
}

Matching PegDown+JSoup Output to PageDown Output

1 Answers1

SAMPLE CODE

Caller.java

script.js

SEE ALSO