Python: get json frontmatter from markdown file

Question

I would like to parse a content file from a static site generator using python3. Such files can have frontmatter in json, yaml or toml at the beginning of the file and after that the content; It's easy to get the frontmatter if it is yaml or toml, because those start end end with a specific string (--- or +++). Is there a way to get the json object from the beginning of the file into a python json object and the content that is the rest of the file into a string?

here is an example of a file based on the frontmatter example of the hugo static site generator:

{
   "categories": [
      "Development",
      "VIM"
   ],
   "date": "2012-04-06",
   "description": "spf13-vim is a cross platform distribution of vim plugins and resources for Vim.",
   "slug": "spf13-vim-3-0-release-and-new-website",
   "tags": [
      ".vimrc",
      "plugins",
      "spf13-vim",
      "vim"
   ],
   "title": "spf13-vim 3.0 release and new website"
}
# Et sed pronos letum minatur

## Hos promissa est induit ductae non tamen

Lorem markdownum est, peragentem nomine fugaeque terruit ista quantum constat
vicinia. Per lingua concita. *Receptus Sibylla* frustra, genitor praesensque
texta vitiatis traxere cum natura feram ducunt terram.

based on the answer to Python Regex to match YAML Front Matter I got this:

matches = re.search(r'^\s*(\{.*\})\s*$(.*)', content, re.DOTALL|re.MULTILINE)

and that basically works, but there could be a another closing curly bracket in the text part below the json part on the beginning of a line- and it doesn't cope with nested json objects

Can you give an example of the input file? show us the code you have tried, and let us know where it goes wrong. — Edwin van Mierlo, Apr 26 '18 at 11:34
added input file example, code and more detailed problem description — , Apr 26 '18 at 12:51
refined the regex and clarified the problem with that approach — , Apr 26 '18 at 14:10
Are you going to have any closing curly braces outside of the json front matter? If not, you could simply do a `str.rfind` to locate the last occurrence and then slice the text to get json. Otherwise, you could loop through the locations of the closing brace and try to load the json; if you get a ValueError, move on to the next occurrence. Brute force comparison test rather than regex. — Alan, Apr 26 '18 at 15:34

score 3 · Answer 1 · answered Sep 06 '18 at 03:45

I too was searching for a tool to do what you described, but I was interested in less primitive frontmatter types, such as TOML and YAML. The following project should provide what you need. I also provide a few snippets from the project's docs to demonstrate the behavior.

Python Frontmatter (docs)

Parse and manage posts with YAML (or other) frontmatter

>>> post = frontmatter.load('tests/hello-world.markdown')
>>> print(post.content)
Well, hello there, world.
>>> print(post['title'])
Hello, world!
>>> print(frontmatter.dumps(post))
---
excerpt: tl;dr
layout: post
title: Hello, world!
---
Well, hello there, world.

Python: get json frontmatter from markdown file

1 Answers1