Split Markdown text file by regular expression that defines headings

Question

I am trying to use a commandline program to split a larger text file into chunks with:

split on defined regex pattern
filenames defined by a capturing group in that regex pattern

The text file is of the format:

# Title

# 2020-01-01

Multi-line content
goes here

# 2020-01-02

Other multi-line content
goes here

Output should be these two files with the following filenames and contents:

2020-01-01.md ↓

# 2020-01-01

Multi-line content
goes here

2020-01-02.md ↓

# 2020-01-02

Other multi-line content
goes here

I can't seem to get all the criteria right.

The regex pattern to split on (separator) is simple enough, something along the lines of ^# (2020-.*)$

Either I can't set up a multi-line regex pattern that goes over \n newlines and stops at the next occurrence of the separator pattern.

Or I can split with csplit on the regex pattern, but I can't name the files with what is captured in (2020-.*)

Same for awk split() or match(), can't get it to work entirely.

I'm looking for a general solution, with the parameter being the regex patterns that define the chunk beginnings (eg. # 2020-01-01) and endings (eg. the next date heading # 2020-01-02 or EOF)

score 1 · Answer 1 · answered Sep 02 '21 at 21:18

Using this regex, here is a perl to do that:

perl -0777 -nE 'while (/^\h*#\h*(2020.*)([\s\S]*?(?:(?=(^\h*#\h*2020.*))|\z))/gm) {
    open($fh, ">", $1.".md") or die $!;
    print $fh $1;
    print $fh $2;
    close $fh;
}' file

result:

head 2020*
==> 2020-01-01.md <==
2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
2020-01-02

Other multi-line content
goes here

score 1 · Accepted Answer · answered Sep 02 '21 at 21:20

1

Using any awk in any shell on every Unix box:

$ awk '/^# [0-9]/{ close(out); out=$2".md" } out!=""{print > out}' file

$ head *.md
==> 2020-01-01.md <==
# 2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
# 2020-01-02

Other multi-line content
goes here

if /^# [0-9]/ isn't an adequate regexp then change it to whatever you like, e.g. /^# [0-9]{4}(-[0-9]{2}){2}$/ would be more restrictive. FWIW though I wouldn't have used a regexp at all for this if you hadn't asked for one. I'd have used:

awk '($1=="#") && (c++){ close(out); out=$2".md" } out!=""{print > out}' file

answered Sep 02 '21 at 21:20

Ed Morton

188,023
17
78
185

The regexp was an attempt to ensure the command can handle a messy input. There may be other level 1 headings there that aren't dates, typos, fenced code blocks with # as comments. The file has 331 headings like that but the command outputs 327 files – Leeroy Sep 02 '21 at 22:13
Figured it out with a diff, the mismatch was due to **duplicate date stamps,** so files would get overwritten. I suppose the solution could be made to handle even this mess, but I'll just edit the source document to correct it. – Leeroy Sep 03 '21 at 07:45
It'd be easy to append to the file instead of overwrite it when a duplicate date is found, or append a ".1" or something to the duplicate file name(s). Let me know if you want any of that and I'll update the answer. – Ed Morton Sep 03 '21 at 14:01

Split Markdown text file by regular expression that defines headings

2 Answers2