How would you go about parsing Markdown?

Question

Edit: I recently learned about a project called CommonMark, which correctly identifies and deals with the ambiguities in the original Markdown specification. http://commonmark.org/ It has great C# library support.

You can find the syntax here.

The source that follows with the download is written in Perl, which I have no intentions of honoring. It is riddled with regular expressions, and it relies on MD5 hashes to escape certain characters. Something is just wrong about that!

I'm about to hard code a parser for Markdown. What is experience with this?

If you don't have anything meaningful to say about the actual parsing of Markdown, spare me the time. (This might sound harsh, but yes, I'm looking for insight, not a solution, that is, a third-party library).

To help a bit with the answers, regular expressions are meant to identify patterns! NOT to parse an entire grammar. That people consider doing so is foobar.

If you think about Markdown, it's fundamentally based around the concept of paragraphs.
As such, a reasonable approach might be to split the input into paragraphs.
There are many kinds of paragraphs, for example, heading, text, list, blockquote, and code.
The challenge is thus to identify these paragraphs and in what context they occur.

I'll be back with a solution, once I find it's worthy to be shared.

@cletus is writing a markdown parser, see http://www.cforcoding.com/search/label/markdown — Alex Angas, Feb 14 '10 at 02:35
I ended up doing the same. However, I'm not trying to parse markdown as if it was a formal grammar, because it's clearly not. I applied different regular expressions in a recursive manner. And in several passes. That worked out very well. — John Leidegren, Feb 14 '10 at 09:39
@JohnLeidegren, any chance other curious users such as myself can see your attempt at parsing markdown? — jmlopez, Feb 27 '13 at 01:57
@jmlopez Sorry, I don't have access to that source any longer, if you need a markdown parser, there is a NuGet package available that can be used. The idea is simple enough though, just apply a series of regular expression in passes, start by paritioning the input in paragraphs then try to identity what kind of paragraph it is, and so on. Finally, parse links and character styles within the paragraphs themselves. — John Leidegren, Feb 28 '13 at 06:43
@JohnLeidegren, I figured that much. I'm currently working on my own (I love to reinvent the wheel) using python (Although there is already one). I am having some trouble identifying the blocks though. I'll take a look at the NuGet package. thanks — jmlopez, Feb 28 '13 at 15:19
In case the discussion is still relevant: I've recently created a decent semantic markup language called Markeven as a part of [Circumflex](https://github.com/inca/circumflex) (for Scala), which resembles Markdown but has more strict rules and some unique features. Markeven has been ported recently to NodeJS under the codename [Rho](http://npmjs.org/package/rho). — BorisOkunskiy, Jul 15 '13 at 10:19
You should look at [Parsedown](http://parsedown.org). It splits text into lines. Then it looks at how these lines start and relate to each other. — Emanuil Rusev, Sep 16 '13 at 22:22
related (if you are looking for java): http://stackoverflow.com/questions/19784525/markdown-to-html-with-java-scala — Jayan, Jun 06 '16 at 15:12
If anybody need parser without renderer. You can use my markdig fork: https://github.com/glebov21/markdigNoRenderer — Glebka, Apr 12 '22 at 11:55

score 73 · Accepted Answer · edited Jun 20 '20 at 09:12

73

The only markdown implementation I know of, that uses an actual parser, is Jon MacFarleane’s peg-markdown. Its parser is based on a Parsing Expression Grammar parser generator called peg.

EDIT: Mauricio Fernandez recently released his Simple Markup Markdown parser, which he wrote as part of his OcsiBlog Weblog Engine. Because the parser is written in OCaml, it is extremely simple and short (268 SLOC for the parser, 43 SLOC for the HTML emitter), yet blazingly fast (20% faster than discount (written in hand-optimized C) and sixhundred times faster than BlueCloth (Ruby)), despite the fact that it isn't even optimized for performance yet. Because it is only intended for internal use by Mauricio himself for his weblog, there are a few deviations from the official Markdown specification, but Mauricio has created a branch which reverts most of those changes.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 03 '09 at 10:35

Jörg W Mittag

363,080
75
446
653

1

interesting. perhaps I will try converting that as an f# project – ShuggyCoUk Feb 11 '10 at 13:51
@Benjol Same old story: no time :/ – ShuggyCoUk Mar 17 '10 at 15:11
1

Terrence Parr (co author of ANTLR) has written one for ANTLR 4: https://github.com/parrt/mini-markdown – Chris S Jun 27 '14 at 22:49

score 18 · Answer 2 · answered May 03 '10 at 08:16

18

I released a new parser-based Markdown Java implementation last week, called pegdown. pegdown uses a PEG parser to first build an abstract syntax tree, which is subsequently written out to HTML. As such it is quite clean and much easier to read, maintain and extend than a regex based approach. The PEG grammar is based on John MacFarlanes C implementation "peg-markdown".

Maybe something of interest to you...

answered May 03 '10 at 08:16

Mathias

181
1
2

4

This is now officially deprecated – Fabich Sep 17 '18 at 15:24

Renaud Bompuis · Answer 3 · 2009-03-03T09:39:24.597

If I was to try to parse markdown (and its extension Markdown extra) I think I would try to use a state machine and parse it one char at a time, linking together some internal structures representing bits of text as I go along then, once all is parsed, generating the output from the objects all stringed together.

Basically, I'd build a mini-DOM-like tree as I read the input file.
To generate an output, I would just traverse the tree and output HTML or anything else (PS, LaTex, RTF,...)

Things that can increase complexity:

The fact that you can mix HTML and markdown, although the rule could be easy to implement: just ignore anything that's between two balanced tags and output it verbatim.

URLs and notes can have their reference at the bottom of the text. Using data structures for hyperlinks could simply record something like:

[my text to a link][linkkey]
results in a structure like: 
    URLStructure: 
    |  InnerText : "my text to a link"
    |  Key       : "linkkey"
    |  URL       : <null>

Headers can be defined with an underline, that could force us to use a simple data structure for a generic paragraph and modify its properties as we read the file:

ParagraphStructure:
|  InnerText    : the current paragraph text 
|                 (beginning of line until end of line).
|  HeadingLevel : <null> or 1-4 when we can assess 
|                 that paragraph heading level, if any.

Anyway, just some thoughts.

I'm sure that there are many small details to take care of and I'm pretty sure that Regexes could become handy during the process.
After all, they were meant to process text.

score 3 · Answer 4 · edited Nov 19 '17 at 19:30

3

I'd probably read the syntax specification enough times to know it, and get a feel for how to parse it.

Reading the existing parser code is of course brilliant, both to see what seems to be the main source of complexity, and if any special clever tricks are being used. The use of MD5 checksumming seems a bit weird, but I haven't studied the code enough to understand why it's being done. A comment in a routine called _EscapeSpecialChars() states:

We're replacing each such character with its corresponding MD5 checksum value; this is likely overkill, but it should prevent us from colliding with the escape values by accident.

Replacing a single character by a full MD5 does seem extravagant, but perhaps it really makes sense.

Of course, it'd be clever to consider creating a "true" syntax, for a tool such as Flex to get out of the regex bog.

edited Nov 19 '17 at 19:30

rici

234,347
28
237
341

answered Mar 03 '09 at 07:45

unwind

391,730
64
469
606

That MD5 thing still bothers me, also the excessive string manipulation has to be slower than any actual decent parser you could write yourself. – John Leidegren Mar 03 '09 at 07:53
2

Flex is really only half the parser; once you have tokenized the input, you need to determine what the tokens mean. This is what a parser generator is for. There are lots of them. ("Parser combinator", "recursive-descent" and "LALR(1)" are key words to google for.) – jrockway Mar 03 '09 at 07:58
1

@jrockway: that is true of course, I guess I shrugged and thought "but if he reads up on Flex, he'll find Bison automatically". :) Thanks. – unwind Mar 04 '09 at 06:43

score 2 · Answer 5 · edited Jan 05 '13 at 21:17

2

MarkdownPapers is another Java implementation whose parser is defined in a JavaCC grammar.

edited Jan 05 '13 at 21:17

Peter Mortensen

30,738
21
105
131

answered Apr 29 '11 at 06:11

Larry Ruiz

21
2

score 2 · Answer 6 · answered Mar 03 '09 at 07:44

2

If Perl isn't your thing, there are Markdown implementations in at least 10 other languages. They probably don't all have 100% compatibility, but tend to be pretty close.

answered Mar 03 '09 at 07:44

Ken

7,052
3
18
8

score 1 · Answer 7 · answered Mar 03 '09 at 07:54

1

If you are using a programming language that has more than three other users, you should be able to find a library to parse it for you. A quick Google-ing reveals libraries for CL, Haskell, Python, JavaScript, Ruby, and so on. It is highly unlikely that you will need to reinvent this wheel.

If you really have to write it from scratch, I recommend writing a proper parser. With this technique, you won't have to escape things with MD5 hashes. (I agree that if you have to do something like this, it's time to reconsider your design.)

answered Mar 03 '09 at 07:54

jrockway

42,082
9
61
86

I'm up for the challenge. I looked at libraries but they're just awful. Ugly and stupid. I'm considering writing the parser in F# because I need a F# project but I'll probably end up doing it in C#. – John Leidegren Mar 03 '09 at 07:58
Hopefully F# has a library like Parsec; if so, this will be a fun project ;) – jrockway Mar 03 '09 at 15:12

score 0 · Answer 8 · edited Sep 05 '15 at 17:07

0

Here you can find a JavaScript-implementation of Markdown. It also relies heavily on regular expressions, as this is just the fastest and easiest way to parse the text.

But it spares the MD5 part.

I cannot help directly with the coding of the parsing, but maybe this link can help you one way or another.

edited Sep 05 '15 at 17:07

p.campbell

98,673
67
256
322

answered Mar 03 '09 at 07:46

Kosi2801

22,222
13
38
45

score 0 · Answer 9 · answered Mar 03 '09 at 07:47

0

There are libraries available in a number of languages, including php, ruby, java, c#, javascript. I'd suggest looking at some of these for ideas.

It depends on which language you wish to use, for the best way to implement it, there will be idiomatic and non idiomatic ways to do it.

Regexes work in perl, because perl and regex are best friends.

answered Mar 03 '09 at 07:47

garrow

3,459
1
21
24

1

Regex and perl are best friends because somebody said so. There's no more truth to that fact than it's historical ancestry, that it has been used like that. I have no use for something like perl. – John Leidegren Mar 03 '09 at 07:55
7

Then don't use it.. Also, learn irony. – garrow Mar 03 '09 at 07:58

TFD · Answer 10 · 2009-03-03T08:37:41.920

0

Markdown is a JAWL (just another wiki language)

There are plenty of open source wiki's out there that you can examine the code of the parser. Most use REGEX

Check out the screwturn wiki, is has an interesting multi pass formatter pipeline, a very nice technique - see /core/Formatter.cs and /core/FormatterPipeline.cs

Best is to use/join an existing project, these sorts of things are always much harder than they appear

edited Mar 03 '09 at 08:37

answered Mar 03 '09 at 08:20

TFD

23,890
2
34
51

I thought that was easy until my parser totally freaked out on lines like: `**hello *world***` the ambiguity if the * is a bitch. – djfm Aug 02 '21 at 01:57

How would you go about parsing Markdown?

10 Answers10

Linked

Related