0

So I'm trying to do a simple regex replacement of a date in the metadata of a document using sed from bash. For example, suppose I have input file test.md containing:

---
title: "I am a file"
date: December 1, 2021
---

Loren ipsum blah blah blah

I'd like to be able to run a bash script on December 29 and get an output file

---
title: "I am a file"
date: December 29, 2021
---

Loren ipsum blah blah blah

So here's my first try:

#!/bin/bash

TODAY=$(date +'%B %d, %Y')
STARTBIT="date: "

FULLDATE="$STARTBIT$TODAY"

REGEX="s/date:\s.*\n/$FULLDATE/"

echo $REGEX # to make sure I'm getting what I think I'm getting

sed  -e $REGEX < test.md > output.md

but I get the following output:

s/date:\s.*\n/date: December 29, 2021/
sed: 1: "s/date:\s.*\n/date:
": unescaped newline inside substitute pattern

so this is a bit confusing, the first line is my echoed pattern, and I definitely don't see any newlines in it on the command line. Nor am I sure quite where newlines would supposedly be??

So then I thought, ok, maybe the newline is appended to the end of one of the variables, and for some reason is made invisible due to some bash silliness when I echo it. So based on this prior SO answer, I just went in and stripped newlines from the end of everything just to make sure. Viz:

#!/bin/bash

TODAY=$(date +'%B %d, %Y')
STARTBIT="date: "
CLEANSTARTBIT=${STARTBIT%%[[:space:]]}
CLEANTODAY=${TODAY%%[[:space:]]}

FULLDATE="$STARTBIT$TODAY"
CLEANFULLDATE=${FULLDATE%%[[:space:]]}

REGEX="s/date:\s.*\n/$CLEANFULLDATE/"
CLEANREGEX=${REGEX%%[[:space:]]}

echo $CLEANREGEX

sed  -e $CLEANREGEX < test.md > output.md

and I'm still getting exactly the same output. But now I'm really stumped. There can't possibly be newlines sneaking in here...

Help??

Bonus possible issues:

  1. I'm using the version of sed that shipped with macOS. Heaven only knows what version. Maybe I should try getting my hands on GNU sed??

  2. I don't really know what flavor of regex sed uses, or indeed how sed works at all... I basically just copied the regex over from the one I was using in a python script since forever, for learning purposes/because I'm sick of calling out to python for this bit of basic text processing that I do all the time. Hah, but I actually know python regex...

Paul Gowder
  • 2,409
  • 1
  • 21
  • 36
  • http://shellcheck.net/ is your friend. As it will point out: _Always_ quote your expansions. `"$REGEX"`, not `$REGEX`. (Also, while it _won't_ point this out, you shouldn't use all-caps names for your own variables; POSIX reserves that namespace for variables that change or reflect operation of the shell and others standard-defined tools). – Charles Duffy Dec 29 '21 at 21:11
  • ...that naming convention is given at https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html -- read it keeping in mind that setting a regular shell variable overwrites any preexisting environment variable by the same name, so you can't be 100% sure you're complying with the standard for environment variables without _also_ complying with it for shell variables. – Charles Duffy Dec 29 '21 at 21:13
  • Also, `\s` isn't guaranteed to work in `sed` _at all_. It's a PCRE extension; `sed` only guarantees support for the BRE regex syntax, or in some cases (with `-E` on BSD sed or `-r` on GNU sed) ERE. Some common operating systems _do_ add `\s` support to their versions of `sed`, but it's not reliable/portable behavior; use `[[:space:]]` instead to work with all standard-compliant implementations. – Charles Duffy Dec 29 '21 at 21:14
  • See https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions for an introduction on POSIX BRE, or https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions re: ERE. – Charles Duffy Dec 29 '21 at 21:19

1 Answers1

1

First problem: you need to double-quote your variable references (e.g. echo "$REGEX" instead of echo $REGEX). Without the double-quotes, the variable's value will be split into "words", and any words that look like filename wildcards will be expanded into a list of matching files. You almost never want either of these things to happen, so you should almost always double-quote variable references. In particular, this command:

sed  -e $REGEX < test.md > output.md

Expands to something like:

sed -e s/date:\s.*\n/date: December 29, 2021/

...and "s/date:\s.*\n/date:", "December", "29,", and "2021/" are all treated as completely separate arguments to sed. The error message is misleading; the real error is that the first one is an incomplete sed command.

(If you happened to have any files matching s/date:\s.*\n/date -- unlikely, but technically possible -- things would get even sillier.)

The second problem is that, as you guessed, your regex is in the wrong syntax dialect. The version that comes with macOS doesn't support the \s shorthand, so use [[:space:]] instead. Also, using \n to match the end of line is invalid in any flavor of sed; use $ instead (but you need to escape it, since it's in double-quotes and you don't want it to initiate some expansion rule):

REGEX="s/date:[[:space:]].*\$/$FULLDATE/"

Technically, you don't need the $ either. Regex matching is greedy, so if it can match to the end of the line -- and it can -- it will match to the end of the line.

But it'd be a good idea to add ^ at the beginning of the pattern, to anchor it to the beginning of a line. Otherwise, it'll match "date: " anywhere in a line.

Third, I'd recommend switching to lower- or mixed-case variable names. There are a bunch of all-caps names with special meanings, and if you accidentally use one of those it can have weird effects.

Final note: use shellcheck.net -- it'll point out a lot of common scripting mistakes (such as failing to double-quote).

Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151