I am trying to parse and sanitize markdown on the client and server side.
On the client side, I use PageDown as a markdown editor. This is exactly what StackOverflow uses, and it comes with a nifty preview box. This preview box shows you sanitized html, so it removes things like
<div>
tags.On the server side, I'm using PegDown and JSoup to parse and sanitize the markdown.
However, I'm finding cases where the output of the two aren't the same. For example:
Input markdown: how are <div>tags</div> treated?
PageDown output: <p>how are tags treated?</p>
PegDown/JSoup output:
<p>how are </p>tags treated?
<p></p>
I'm not doing anything fancy with JSoup. Here's my code:
public class Main {
public static void main(String... args){
PegDownProcessor pdp = new PegDownProcessor();
String markdown = "how are <div>tags</div> treated?";
String html = pdp.markdownToHtml(markdown);
Whitelist whitelist = Whitelist.relaxed().removeTags("div");
html = Jsoup.clean(html, whitelist);
System.out.println(html);
System.out.println("Done.");
}
}
I understand why this is happening, and I'm not surprised that two different systems generate two different outputs. My question is: how can I setup JSoup so that it simply removes the <div>
tags instead of adding extra <p>
tags?
My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing. If there are better ways to do that, I'm open to suggestions. I don't really care if the outputs of the two are exactly identical, but things like extra <p>
tags are going to be very noticeable by users, so I'm trying to eliminate this one major difference.
Bonus question: is there a list of the html tags and attributes that PageDown can output?
Edit: I've also tried using the OWASP sanitizer, but I get very similar results: the <div>
tags are removed, but the <p>
tags are "fixed" in the above way, which results in different html than PageDown's sanitizer.
how are tags treated?
`, but if I wrap the markdown in `` tags, the output is simply `how are tags treated?` without any html tags. This is going to be even worse for real-world data, which is much longer than a single line. – Kevin Workman Mar 21 '16 at 12:55` tag. I'm looking for a way to disable that feature, or to at least rearrange the sanitization steps so that the `