7

How might I get a list of elements from a markdown file in python 3? I'm specifically interested in getting a list of all images and links (along with relevant information like alt-text and link text) out of a markdown file.

this Is some prior art in this area, but it is almost exactly 2 years old at this point, and I expect that the landscape has changed a bit.

Bonus points if the parser you come up with supports multimarkdown.

Andrew Spott
  • 3,457
  • 8
  • 33
  • 59
  • @coralv : I've looked into regex to extract the links, but run into the problem that I really need a push down automaton for that, in order to account for arbitrary nested brackets. Mostly I'm looking for a library solution before I build a parser. – Andrew Spott Dec 03 '16 at 08:08
  • Markdown itself hasn't changed in over a decade, so I'd say the linked question and answers are pretty up-to-date. – Waylan Dec 04 '16 at 00:44

2 Answers2

9

If you take advantage of two Python packages, pypandoc and panflute, you could do it quite pythonically in a few lines (sample code):

Given a text file example.md, and assuming you have Python 3.3+ and already did pip install pypandoc panflute, then place the sample code in the same folder and run it from the shell or from e.g. IDLE.

import io
import pypandoc
import panflute

def action(elem, doc):
    if isinstance(elem, panflute.Image):
        doc.images.append(elem)
    elif isinstance(elem, panflute.Link):
        doc.links.append(elem)

if __name__ == '__main__':
    data = pypandoc.convert_file('example.md', 'json')
    doc = panflute.load(io.StringIO(data))
    doc.images = []
    doc.links = []
    doc = panflute.run_filter(action, prepare=prepare, doc=doc)

    print("\nList of image URLs:")
    for image in doc.images:
        print(image.url)

The steps are:

  1. Use pypandoc to obtain a json string that contains the AST of the markdown document
  2. Load it into panflute to create a Doc object (panflute requires a stream so we use StringIO)
  3. Use the run_filter function to iterate over every element, and extract the Image and Link objects.
  4. Then you can print the urls, alt text, etc.
Sergio Correia
  • 982
  • 1
  • 8
  • 12
3

You can convert the markdown into html with Python-Markdown, and then extract what you want from the html document using Beautiful Soup, which makes extracting images and links very straightforward.

This might seem like a complicated pipeline, but it's certainly easier and more robust than for instance writing an ad hoc markdown parser using regular expressions. These modules are battle tested and efficient.

Håken Lid
  • 22,318
  • 9
  • 52
  • 67
  • 1
    Python-Markdown uses ElementTree internally and has an extensive Extension API. You may be able to interrupt the parser and loop over the ElementTree to extract your elements and skip a few steps. But that would be bending things in ways they were not really intended, so parsing the HTML output is probably going to give more reliable results. – Waylan Dec 04 '16 at 00:35
  • Python-Markdown also has an extensive number of extensions (both [included](https://pythonhosted.org/Markdown/extensions/index.html#officially-supported-extensions) and [third-party](https://github.com/waylan/Python-Markdown/wiki/Third-Party-Extensions)) available so you should be able to get most, if not all, of MultiMarkdown's features. And if there's a missing feature you really care about, you can [write your own extension](https://github.com/waylan/Python-Markdown/wiki/Tutorial:-Writing-Extensions-for-Python-Markdown). – Waylan Dec 04 '16 at 00:39