31

I want to parse and then traverse a Markdown file. I'm looking for something like xml.etree.ElementTree but for Markdown.

One option would be to convert to HTML and then use another library to parse the HTML. But I'd like to avoid that step.

Thanks.

jpemberthy
  • 7,473
  • 8
  • 44
  • 52

2 Answers2

25

As another comment mentioned, Python-Markdown has an extension API and it happens to use xml.etree.ElementTree under the hood. You could theoretically create an extension that accesses that internal ElementTree object and do what you want with it. However, if you use raw HTML (including HTML entities) and/or the codehilite extension, you will get an incomplete document as there are a few postprocessors that run on the serialized string. So I wouldn't really recommenced it for your intended purpose (full disclosure: I'm the developer of Python-Markdown).

A rather lengthy list if Markdown implementations exists here. Of the pure Python implementations in that list, Mistune is the only one that I am aware of that uses a two step process (step one returns a parse tree, step two serializes the parse tree -- you only need step one). I have never used Mistune personally and cannot speak to its stability or accuracy, but it is supposed to be a Python clone of the very good JavaScript library Marked.

*** Edit ***

A few newer Python packages have become available which all use the parser/renderer pattern and/or parse tree/token stream to varying degrees. I don't have any personal experience with any of them, but they may be useful for this purpose. See mistletoe, markdown-it-py, and marko.

*** End Edit ***

If you search around, I believe that a few of the C implementations use a similar pattern. Some of them might even already have a Python wrapper. If not, it shouldn't to too difficult to create a wrapper with ctypes.

If for some reason you want to use an implementation that does not give you a full parse tree, then I would suggest parsing the resulting HTML using LXML (A python wrapper of the C lib) or html5lib (pure python), both of which can return an ElementTree object and are much faster (especially LXML) and more forgiving of invalid HTML (especially html5lib, which acts more like real browsers in the real world). Remember that Markdown can contain raw HTML and most Markdown parsers simply pass it through, valid-or-not. If you then try to parse it with a XML based parser (like in xml.etree) or a strict HTML parser (like html.parser in the standard lib), a single invalid tag can crash the HTML parser.

Waylan
  • 37,164
  • 12
  • 83
  • 109
  • @Waylan - how can one access the Internal ElementTree in Python-Markdown? Thank you! – jim70 Oct 11 '20 at 03:32
  • @jim70 you can only do that from an extension. Specifically from a [treprocessor](https://python-markdown.github.io/extensions/api/#treeprocessors). – Waylan Oct 12 '20 at 18:24
  • Thank you for your reply! Is there any example I can try and study and learn of somebody using the Markdown ElementTree to extract portions of the element tree to add to another Markdown file? I tried but failed to solve it for myself. :( – jim70 Oct 13 '20 at 19:45
  • 1
    @jim70 the Extension API is not intended to be used that way. Rather it is intended to alter the document. I expect you would have more success using a parser which generated a token stream or syntax tree. See the newer libraries I linked to in an edit to my answer above. – Waylan Oct 15 '20 at 13:18
  • Thank you for the edits that you added to your answer. Will try and go through those packages to see if I can decipher how to extract the token stream / syntax tree from how those libraries work. @Waylan – jim70 Oct 16 '20 at 14:36
  • Hello from 2023 - `mistune` definitely provides an AST like product via https://mistune.lepture.com/en/latest/guide.html#ast `mistune.create_markdown(renderer='ast')` – David Mar 27 '23 at 20:51
5

There are Markdown parsing modules, but unlike XML and HTML processing modules, they tend to be embedded within Markdown rendering packages, rather than presented for arbitrary Markdown parsing work.

So option one would be to look into Markdown processors in Python, of which there are a ton, find the parser you like most, and adopt that.

Depending on what you want to accomplish, however, it might be easier to find a Markdown processing module that's already extensible, and build a processing extension. Python-Markdown, e.g., has an complete extension mechanism.

joelhed
  • 60
  • 6
Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77