1

I would like to parse a content file from a static site generator using python3. Such files can have frontmatter in json, yaml or toml at the beginning of the file and after that the content; It's easy to get the frontmatter if it is yaml or toml, because those start end end with a specific string (--- or +++). Is there a way to get the json object from the beginning of the file into a python json object and the content that is the rest of the file into a string?

here is an example of a file based on the frontmatter example of the hugo static site generator:

{
   "categories": [
      "Development",
      "VIM"
   ],
   "date": "2012-04-06",
   "description": "spf13-vim is a cross platform distribution of vim plugins and resources for Vim.",
   "slug": "spf13-vim-3-0-release-and-new-website",
   "tags": [
      ".vimrc",
      "plugins",
      "spf13-vim",
      "vim"
   ],
   "title": "spf13-vim 3.0 release and new website"
}
# Et sed pronos letum minatur

## Hos promissa est induit ductae non tamen

Lorem markdownum est, peragentem nomine fugaeque terruit ista quantum constat
vicinia. Per lingua concita. *Receptus Sibylla* frustra, genitor praesensque
texta vitiatis traxere cum natura feram ducunt terram.

based on the answer to Python Regex to match YAML Front Matter I got this:

matches = re.search(r'^\s*(\{.*\})\s*$(.*)', content, re.DOTALL|re.MULTILINE)

and that basically works, but there could be a another closing curly bracket in the text part below the json part on the beginning of a line- and it doesn't cope with nested json objects

Anthon
  • 69,918
  • 32
  • 186
  • 246
  • Can you give an example of the input file? show us the code you have tried, and let us know where it goes wrong. – Edwin van Mierlo Apr 26 '18 at 11:34
  • added input file example, code and more detailed problem description –  Apr 26 '18 at 12:51
  • refined the regex and clarified the problem with that approach –  Apr 26 '18 at 14:10
  • Are you going to have any closing curly braces outside of the json front matter? If not, you could simply do a `str.rfind` to locate the last occurrence and then slice the text to get json. Otherwise, you could loop through the locations of the closing brace and try to load the json; if you get a ValueError, move on to the next occurrence. Brute force comparison test rather than regex. – Alan Apr 26 '18 at 15:34

1 Answers1

3

I too was searching for a tool to do what you described, but I was interested in less primitive frontmatter types, such as TOML and YAML. The following project should provide what you need. I also provide a few snippets from the project's docs to demonstrate the behavior.

Python Frontmatter (docs)

Parse and manage posts with YAML (or other) frontmatter

>>> post = frontmatter.load('tests/hello-world.markdown')
>>> print(post.content)
Well, hello there, world.
>>> print(post['title'])
Hello, world!
>>> print(frontmatter.dumps(post))
---
excerpt: tl;dr
layout: post
title: Hello, world!
---
Well, hello there, world.
ngenetzky
  • 111
  • 1
  • 2