Can I generate a references index/ book index from markdown to html (ideally in a static site)?

Question

For an academic project, I would like to make an index. You know, that boring list of words that indicates in which pages every word listed is. This : https://www.pdfindexgenerator.com/what-is-a-book-index/. But for a website.

My goal is, let's say, from Markdown, to generate HTML pages. I would love to do this with a static site, because the content won't evolve every day, and it appears to me that I'd have to parse all the content anyway. Maybe the solution is just using a wiki.

Here's how I would have done it : you write a bunch of text into page.md, inside this text you identify a [word] that you want to see in your index with a specific markup. And then you mention this same [word] with the same markup into otherpage.md. Then, the generator extracts all the marked words, makes a list, and generates a page with the links to all the references to each marked word.

Word:

page.html
otherpage.html

A reference index. Yay.

What I want is like a simpler version of LaTeX's MakeIndex. Like, closer to this https://wordpress.org/plugins/lexicographer/, but not for definitions, only for internal references. Pandoc seems to not be supporting indices, maybe because MakeIndex is very complex (but indices are actually, so well, that's fair play) or just because it's made for page numbers and not html links.

So :

I know indices are actually complicated stuff. It's impossible to fully automatize. My only goal here is to be able to tag the words as I write and having some computer help to make the listing at the end and render a neat HTML page with all the links because this part is really boring (like MakeIndex does). But maybe just this part is impossible and I'd be fine with this.
Is this already implemented somewhere, if it's not impossible? There is plenty of static sites and wikis and stuff, maybe someone thought about it before me, as indices are academic stuff used for CENTURIES. Maybe there's a plugin or a piece of software I just don't know.
I would appreciate just pointers to know where to go to do this if its doable. There is a start here How to generate (book) indexes? but it's too little for me to understand what to do next.

Thanks a lot <3

A quick solution would be to use the [taxonomy system](https://learn.getgrav.org/16/content/taxonomy) to manually add relevant words as tags to your pages. You could then write up some relatively simple Twig code in your reference index page that loops over all pages, reads their tags and links to those pages based on the tags (words). — domsson, Apr 12 '20 at 11:35

score 1 · Answer 1 · answered May 19 '20 at 13:43

Using Pelican you can use tags that way.

You can add the following to your index.html template to loop through existing tags :

{% if tags %}
    {% for tag, articles in tags %}
        {{tag }} : 
        <ul>
            {%for article in articles%}
                <li><a href="{{ article.url }}">{{ article.title }}</a></li>
            {% endfor %}
        </ul>
    {% endfor %}
{% endif %}

Then you will get the following result :

You can't directly tag your text the way you showed though. You'll have to add the tag line in your article's headings :

Title: mytitle
Date: 2020-05-19
Tags: firsttag, othertag
...

You can add this to your index.html template or to tags.html, as you see fit.

My understanding is that the index.html is automatically generated when you "build" your document. This means that any edits I make on the index.html file will disappear when I re-build. Is there a way to incorporate your suggestion in a more permanent way? — TCS, Aug 01 '22 at 13:42

score 0 · Answer 2 · answered Aug 27 '23 at 17:22

goal here is to be able to tag the words as I write and having some computer help to make the listing at the end and render a neat HTML page with all the links

The hxindex utility from the W3C html-xml-utils package does a decent job of creating a back-of-the-book index. Here's an example using pandoc to convert Markdown to HTML, and hxindex to produce the HTML index -- shown here in standard form^[1] with links (locators) underlined and links in bold pointing to the defining term instance.

dolor, Topic B
- yad, Topic F, Topic E, Topic I
  - yyad, Topic E
inventor, Topic C
ipso
- -a, Topic C
- -am, Topic D
- -um, Topic E
nada
- see: nihil, Topic H
nihil, Topic H
- see also: nulla, Topic I
nulla
- see: nihil, Topic H
perspicio, Topic A

^[1] The index is a list of lists (<ul class="indexlist"><li>…<ul><li>… ) which can be styled with CSS.

In Markdown files HTML markup for the index looks like this, !! separates term levels and | multiple terms:

dolor : <dfn title="dolor">doloremque</dfn>
… yyad : <span class="index" title="dolor!!yad!!yyad">dolor</span>
ipso, -um : <dfn title="ipso!!-um">ipsum</dfn>
nihil a.o. : <span class="index" title="nihil|nulla!!see: nihil|nada!!see: nihil">nihil</span>

Note that

the title attribute is used (abused) in Markdown files, for the HTML output hxindex replaces it with an id attribute
the see also: nulla reference is defined at the target to get the link right (this is the only way to do it directly with hxindex so doing it in script instead could be tempting)
there is no limit on the number of index subterm levels
the hxindex man page lists several options not used in this example
the hxref utility generates cross-references inside and between HTML files

Following is the makefile and Markdown source for the example, plus the generated index database file.

File: Makefile

# desc:
#   Use `hxindex` to build an HTML index for Markdown files.
# compat:
#   GNU make 4.3  pandoc 2.9.2  html-xml-utils 7.7
# ref:
#   https://en.wikipedia.org/wiki/Index_(publishing)
#   https://www.w3.org/Tools/HTML-XML-utils/man1/hxindex

SHELL := /bin/sh
.NOTPARALLEL :      ## must access $(indextsv) database serially
.DELETE_ON_ERROR :
cssfile := doc.css
indextsv := doc-x.tsv

pandocMeta ?= -M lang='la' -M pagetitle='Topics'
pandocFlags ?= --standalone -w html --css='$(cssfile)' $(pandocMeta)
hxindexFlags ?= -x -f -N
hxnormalizeFlags ?= -x -d

# $(call md2html[,out=$@[,in=$<[,pandocExtraFlags[,hxindexExtraFlags]]]])
define md2html =
    pandoc $(pandocFlags)$(if $3, $3) -- $(if $2,$2,$<) \
    | hxindex $(hxindexFlags)$(if $4, $4) \
    | hxnormalize $(hxnormalizeFlags) \
    | hxremove 'head>meta[name],head>style' \
    > $(if $1,$1,$@)
endef


#:Single HTML file -- $(indextsv) not required
doc-a.html : $(patsubst %,doc-%.md,1 2 3 x) | $(cssfile) ; \
    $(call md2html,$@,$^,,)

#:Multiple HTML files -- index in doc-x.html
doc-x.html : $(indextsv) $(patsubst %,doc-%.html,1 2 3)
doc-%.html : doc-%.md $(indextsv) | $(cssfile); \
    $(call md2html,$@,$<,,-b $@ -i $(indextsv))

# Truncate to size zero
$(indextsv) : ; : > $@

# Minimal CSS
$(cssfile) : ; printf '%s\n' > $@ \
  'body{color:#111; background-color:#fffff8; margin:4em; font-family:serif;}' \
  'dfn{font-weight:bold; font-variant:small-caps;}' \
  'span[class~="index"]{text-decoration:underline;}'

#:Delete generated files
clean : ; rm -f -- doc-?.html $(indextsv) $(cssfile)
.PHONY : clean

File: doc-1.md

## Section 1

### Topic A

Sed ut <dfn title="perspicio">perspiciatis</dfn>, unde omnis iste 
natus error sit 

### Topic B

voluptatem accusantium <dfn title="dolor">doloremque</dfn> 
laudantium, totam rem aperiam

### Topic C

eaque <span class="index" title="ipso!!-a">ipsa</span>, quae ab 
illo <dfn id="inven…tor" title="inventor">inventore</dfn> veritatis 
et quasi architecto beatae vitae dicta sunt, explicabo.

File: doc-2.md

## Section 2

### Topic D

Nemo enim <span class="index" title="ipso!!-am">ipsam</span> voluptatem, 
quia voluptas sit, aspernatur aut odit aut 
fugit, sed quia consequuntur magni dolores eos, qui ratione voluptatem

### Topic E

sequi nesciunt, neque porro quisquam est, qui 
<span class="index" title="dolor!!yad">dolorem</span> 
<dfn title="ipso!!-um">ipsum</dfn>, quia 
<span class="index" title="dolor!!yad!!yyad">dolor</span>
sit amet consectetur adipisci velit, sed quia non numquam eius modi

### Topic F

tempora incidunt, ut labore et 
<span class="index" title="dolor!!yad">dolore</span> 
magnam aliquam quaerat voluptatem.

File: doc-3.md

## Section 3

### Topic G

Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis 
suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? 

### Topic H

Quis autem vel eum iure reprehenderit, qui in ea voluptate velit esse, 
quam <span class="index" title="nihil|nulla!!see: nihil|nada!!see: nihil"
>nihil</span> molestiae consequatur, 

### Topic I

vel illum, qui <span class="index" title="dolor!!yad">dolorem</span> 
eum fugiat, quo voluptas <span class="index" title="nihil!!see also: nulla">nulla</span> pariatur?

File: doc-x.md (index placeholder)

## Index

<!--index-->

File: doc-x.tsv (the hxindex database as output by expand -t 26,31,58)

dolor!!yad                1    doc-2.html#dolore          # Topic F Topics
dolor                     2    doc-1.html#doloremque      # Topic B Topics
dolor!!yad!!yyad          1    doc-2.html#dolor           # Topic E Topics
dolor!!yad                1    doc-3.html#dolorem         # Topic I Topics
dolor!!yad                1    doc-2.html#dolorem         # Topic E Topics
ipso!!-a                  1    doc-1.html#ipsa            # Topic C Topics
ipso!!-um                 2    doc-2.html#ipsum           # Topic E Topics
nada!!see: nihil          1    doc-3.html#nihil           # Topic H Topics
nihil!!see also: nulla    1    doc-3.html#nulla           # Topic I Topics
perspicio                 2    doc-1.html#perspiciatis    # Topic A Topics
nulla!!see: nihil         1    doc-3.html#nihil           # Topic H Topics
nihil                     1    doc-3.html#nihil           # Topic H Topics
ipso!!-am                 1    doc-2.html#ipsam           # Topic D Topics
inventor                  2    doc-1.html#inven…tor     # Topic C Topics

(The glitch in the last line is caused by the 3-byte Unicode ellipsis character U+2026 being counted as 3 characters.)

Can I generate a references index/ book index from markdown to html (ideally in a static site)?

2 Answers2