How to merge multiple markdown files with pandoc while retaining cross document links?

Question

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc. The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.

Simple example documents:

0001-file-a.md

---
id: 0001

---
# File a

This is a simple file without an [external link](www.stackoverflow.com).

0002-file-b.md

---
id: 0002

---
# File b

This file links to [another file](0001-file-a.md).

By default pandoc does not handle this case out of the box, e.g. when running the following command:

pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf

It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.

This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753

It defines a filter that is first applied to each file and then the resulting JSON files are merged. I modified this filter to fit my needs:

Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links

Here is the filter:

#!/usr/bin/env python3

from pandocfilters import toJSONFilter, Header, Link
import re
import sys

"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
    global headerL1

    # Store level 1 headers
    if key == "Header":
        [level, [label, t1, t2], header] = value
        if level == 1:
            id = meta.get("id")
            newlabel = f"{id['c'][0]['c']}-{label}"
            headerL1 = [newlabel]
            sys.stderr.write(f"\nGlobal header: {headerL1}\n")
            return Header(level, [newlabel, t1, t2], header)

        # Prepend level 1 header label to all other header labels
        if level > 1:
            prefix = headerL1[0]
            newlabel = prefix + "-" + label
            sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
            return Header(level, [newlabel, t1, t2], header)

    if key == "Link":
        [t1, linktext, [linkref, t4]] = value

        if ".md" in linkref:
            newlinkref = re.sub(r'.md', r'', linkref)
            sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
            return Link(t1, linktext, [newlinkref, t4])
        else:
            sys.stderr.write(f'External link: {linkref}\n')


if __name__ == "__main__":
    toJSONFilter(fix_links)

And here is a script that executes the whole thing:

#!/bin/bash

MD_INPUT=$(find . -type f | grep md | sort)

# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
  echo "Filtering $file"
  pandoc \
  --filter fix-links.py \
  "$file" \
  -t json \
  -o "${file%.md}.json"
done

JSON_INPUT=$(find . -type f | grep json | sort)

echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex

echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf

Applying this script generates a PDF where the second link does not work at all. Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink. Once this is done the linking works as expected.

The problem now is that this isn't done automatically by pandoc, which almost seems like a bug. Is there a way to tell pandoc a link is internal from within the filter?

After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.

How to merge multiple markdown files with pandoc while retaining cross document links?

0 Answers0