4

I've been looking through creating markup languages similar to Markdown. I was wondering where to start with something like this. I've researched a bit on creating languages, and I've ended up with tutorials talking about lexers and ASTs - in the end, these languages are passed to something like LLVM.

From what I understand, languages like C are imperative languages, and languages like Markdown are declarative. What exactly does the toolchain look like for something that probably isn't going to touch anything like the LLVM?

I've seen other answers like how to tokenize a language in Python. However, how might I do this in C? I'd like to have something that can be used anywhere (e.g. integrated into a Ruby native extension, or in a C# project).

I can't seem to find a good direction to go with this. Does anybody have experience / tips on where to start? At what point and where would I build the "binary" (creating HTML from source code?)

Does Markdown even use a lexer? From the syntax, it looks like it could very well just use regular expressions.

Apologies if this is too broad, but I can't find very much info on the topic (perhaps I'm just looking in the wrong places!)

Community
  • 1
  • 1
Alexander Lozada
  • 4,019
  • 3
  • 19
  • 41
  • 1
    Any good parser (Markdown or not) should **never** use regex to do parsing. Regex is for *regular* expressions. Programming/Markup languages are by their very nature irregular. – Sinkingpoint Aug 26 '16 at 00:39
  • Mostly for my own benefit and practice. – Alexander Lozada Aug 26 '16 at 00:48
  • 3
    In fact regex is enough for parsing Markdown, as it mostly relies on formatting symbols, rather than keywords. There is not too much to parse/analyze but some recognizable symbols. Most Markdown parsers use regex. Here you can find an explanation about how to do it: https://github.com/Khan/simple-markdown – Jaime Aug 26 '16 at 00:58
  • Thank you very much for that! I guess perhaps I was looking at this the wrong way. – Alexander Lozada Aug 26 '16 at 00:59
  • Imperative and declarative are kinds of programming languages; Markdown is not a programming language. – user253751 Aug 26 '16 at 02:20
  • Declarative languages don't have to be *programming* languages (at least from what I read.) e.g., YAML or XML is declarative (or at least a [subparadigm of declarative langs](https://en.wikipedia.org/wiki/Declarative_programming)). – Alexander Lozada Aug 26 '16 at 03:02

1 Answers1

7

You are right, simple markup languages like Markdown are declarative. Very simple implementations exist that do not involve any lexers and ASTs.

The original Markdown implementation, for example, was a simple Perl script using regular expressions. It was written by John Gruber (the creator of Markdown) and is available here: http://daringfireball.net/projects/downloads/Markdown_1.0.1.zip

There is also a C implementation you can have a look at, called Discount, available here: http://www.pell.portland.or.us/~orc/Code/discount/

Both these tools are completely open-source and show you exactly what is necessary to process a markup language. They include the whole toolchain, including the parser.

Yann Bodson
  • 1,634
  • 1
  • 17
  • 29
  • I appreciate the link to discount. However, it's more of an end result. I'm interested in *how* to get there, and what tools are involved. – Alexander Lozada Aug 26 '16 at 00:49
  • 4
    Markup languages are so simple, there are no other tools involved. Generally just regular expressions as you can see in the code I linked to... – Yann Bodson Aug 26 '16 at 02:04