6

I'm having trouble crafting a regex to match YAML Front Matter

This is the front matter I was trying to match:

    ---
    name: me
    title: test
    cpu: 1
    ---

This is what I thought would work:

re.search( r'^(---)(.*)(---)$', content, re.MULTILINE)

Any help would be greatly appreciated.

Mitciv
  • 813
  • 3
  • 15
  • 21
  • 1
    Are you getting whitespace before or after the dashes (---)? That'd break it. – MrGomez Mar 18 '12 at 06:03
  • Ya, Its possible to get whitespace before/after. If you dont mind, how would I adjust my regex to handle that? – Mitciv Mar 18 '12 at 06:18
  • Given my attempts to just answer the question, I'll roll this into an answer. :) – MrGomez Mar 18 '12 at 06:24
  • You should also add [`re.DOTALL`](http://docs.python.org/library/re.html#re.DOTALL) to your expression, otherwise `re.MULTILINE` won't do what you expect. – Burhan Khalid Mar 18 '12 at 06:25
  • I did end up figuring it out, but please submit your answer so I can accept it. – Mitciv Mar 18 '12 at 06:26
  • Sorry for the delay; detailed answer. – MrGomez Mar 18 '12 at 06:41
  • Hi trying to do the same, but with `re.sub` instead of `re.search`, with the latter everything works well, but when I use the same regex with `re.sub` the text is not being replaced, any idea why? – Alejandro Alcalde Apr 18 '17 at 19:11

2 Answers2

9

To unpack what you are currently doing with this regular expression:

r'^(---)(.*)(---)$':

  • r: Treat this as a string literal in Python
  • ^: Start the evaluation at the beginning of a line
  • (---): Parse --- into an anonymous capture group
  • (.*): Parse all characters (.) non-greedily (*) until the next expression
  • (---): As above
  • $: End at the evaluation of the end of a line

The trouble is this will fail when whitespace is present. You're literally saying: find dashes that occur at the beginning of a line and parse until we find dashes that occur at the end of one. Furthermore, you're creating groups that I believe are not necessary to the useful evaluation of your regular expression, by using parentheses () around the dashes used to find YAML front matter.

A better expression would be:

r'^\s*---(.*)---\s*$'

Which adds the repeating group \s* to capture whitespace characters between the beginning of the first line up to the dashes, adds this again between the second group of dashes to the end of that line, and captures everything between into a single anonymous capture group that you can then use for additional processing. If extracting the contents of the front matter isn't desired, simply replace (.*) with .*.

Consider re.findall for multiple evaluations of this regular expression in a single file, and as mentioned, use re.DOTALL to allow the dot character to match new lines.

MrGomez
  • 23,788
  • 45
  • 72
2

I've used something like this regex, re.findall('^---[\s\S]+?---', text):

def extractFrontMatter(markdown):
    md = open(markdown, 'r')
    text = md.read()
    md.close()
    # Returns first yaml content, `--- yaml frontmatter ---` from the .md file
    # http://regexr.com/3f5la
    # https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match
    match = re.findall('^---[\s\S]+?---', text)
    if match:
        # Strips `---` to create a valid yaml object
        ymd = match[0].replace('---', '')
        try:
            return yaml.load(ymd)
        except yaml.YAMLError as exc:
            print exc

I've also come across python-frontmatter, which has some additional helper functions:

import frontmatter
post = frontmatter.load('/path/to-markdown.md')

print post.metadata, 'meta'
print post.keys(), 'keys'
Vinnie James
  • 5,763
  • 6
  • 43
  • 52