3

My problem is similar to this post, but not identical. I somehow can't figure out the correct pandoc command line parameters for maintaining/resolving cross-document links when using a couple of interlinked HTML files as the input.

Let's say I have two files, chapter1.xhtml and chapter2.xhtml located in the /home/user/Documents folder with the following contents:

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<h3>Chapter 1</h3>
<p><a href="/home/user/Documents/chapter2.xhtml">Next chapter</a><br /></p>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</body>
</html>

which contains a link to the next document.

and

<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<h3>Chapter 2</h3>
<p><a href="/home/user/Documents/chapter1.xhtml">Previous chapter</a><br /></p>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</body>
</html>

which contains a link to the previous document.

I used the following command line parameters:

$ pandoc -s --toc --verbose -o /home/user/Documents/output.markdown /home/user/Documents/chapter1.xhtml /home/user/Documents/chapter2.xhtml

And I got the following output:

---
---

-   [Chapter 1](#chapter-1)
-   [Chapter 2](#chapter-2)

### Chapter 1

[Next chapter](/home/user/Documents/chapter2.xhtml)\

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

### Chapter 2

[Previous chapter](/home/user/Documents/chapter1.xhtml)\

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

This problem also occurs when I select docx or latex/pdf as the output format. I also tried to use relative links, but nothing worked.

What are the correct parameters for resolving cross-document links?

tl;dr I.e. I don't want link references that contain the original paths; I want them to point to the new output document.

Community
  • 1
  • 1
Nemo XXX
  • 644
  • 2
  • 14
  • 35
  • so you want the links to point to the markdown files? I guess you'll have to write a [pandoc filter](http://pandoc.org/scripting.html) to change the links... – mb21 May 21 '16 at 09:44
  • Actually, I don't need to convert to markdown, I actually want to create docx/pdf files with working hyperlinks. I merely chose markdown output to illustrate a problem that occurs with **all** output formats. **pandoc** should be able to parse and resolve all hyperlinks and make them point to the output file **before** generating the output file. Generating files with broken links by default isn't very helpful. IMHO this is a major bug. – Nemo XXX May 21 '16 at 14:41
  • if you know the name of the generated output files, you can link to them without any problem, e.g. `file2.pdf#header-id` – mb21 May 21 '16 at 15:03

1 Answers1

2

The problem is that your links contain absolute paths (/home/user/Documents/chapter1.xhtml) instead of relative ones (chapter1.xhtml). I cannot imagine the ePUB file containing absolute paths, and if it does, the links in the file will only ever work correctly on your computer. So the solution will have to be fixing those ePUB files before feeding them to pandoc.

Note that roundtripping from pandoc from markdown to epub and back to html works as expected:

$ pandoc -o foo.epub
# foo

adfs

# bar

go [to foo](#foo)


$ unzip foo.epub

$ cat ch002.xhtml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title>bar</title>
  <link rel="stylesheet" type="text/css" href="stylesheet.css" />
</head>
<body>
<div id="bar" class="section level1">
<h1>bar</h1>
<p>go <a href="ch001.xhtml#foo">to foo</a></p>
</div>
</body>
</html>

$ pandoc foo.epub

<p><span id="ch001.xhtml"></span></p>
<div id="ch001.xhtml#foo" class="section level1">
<h1>foo</h1>
<p>adfs</p>
</div>
<p><span id="ch002.xhtml"></span></p>
<div id="ch002.xhtml#bar" class="section level1">
<h1>bar</h1>
<p>go <a href="#ch001.xhtml#foo">to foo</a></p>
</div>

P.S.

Using two input documents like:

pandoc -o output.md chapter1.xhtml chapter2.xhtml

works as the pandoc README states:

If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing.

So for the parsing done by pandoc, it sees it as one document... so no wonder that cross-file links won't work.

mb21
  • 34,845
  • 8
  • 116
  • 142
  • Thanks for your feedback. However, your idea won't work because **href="#ch001.xhtml#foo** is an invalid html URI. – Nemo XXX May 22 '16 at 18:26
  • Since both documents are in the same folder I tested it without absolute path names. E.g. `

    Previous chapter

    ` and `

    Next chapter

    ` and I got the link format that you mentioned, which appears to be indeed valid in HTML5, but it's not valid in DOCX, ODT or LaTex/PDF and will cause broken links. Unless I get a better suggestion, I'll simply switch to [Calibre](https://en.wikipedia.org/wiki/Calibre_%28software%29), which doesn't break links.
    – Nemo XXX May 23 '16 at 13:29
  • 1
    I just generated a LaTeX-PDF from the ePUB I generated with pandoc (see in the answer) and the link worked: `pandoc foo.epub -o foo.pdf` – mb21 May 23 '16 at 14:43
  • Unfortunately, it didn' work for me. (BTW, I used XeTex. So maybe this is a LaTex configuration issue.) Can you please post the original ePub and the generated pdf file to a one-click file hoster/dropbox and indicate the exact command line parameters that you used? (Also, include any customized LaTex templates or other non-standard files that were used to generate the output.) If I can reproduce your results, I'll immediately close the question and award you the 50 bounty points. – Nemo XXX May 23 '16 at 17:50
  • [here all three files](https://www.dropbox.com/sh/vbcsuh0rs4kd37c/AADXYAASHarC8nMeH-R7MkfKa). I used `pandoc -o foo.epub foo.md` to generate the epub, then `pandoc --latex-engine xelatex -o foo.pdf foo.epub` to generate the pdf. Opening the pdf e.g. in Chrome, scrolling down and clicking the link results in scrolling back up. – mb21 May 23 '16 at 21:00
  • using `pandoc 1.17.0.2` and `XeTeX 3.14159265-2.6-0.99992 (TeX Live 2015)` – mb21 May 23 '16 at 21:06
  • I was able to reproduce your results, but the result is misleading, because all links will link back to the beginning of the document. If you implement other cross-document links, you'll see that they won't work. You can test this with this [test document](https://www.dropbox.com/s/u11iwp021sqtslb/linktest.epub?dl=0). If you convert it to .pdf with the same pandoc parameter that you used, you'll see that that only the TOC links work. Apparently pandoc can only handle TOC links. Since you've helped me to better understand how pandoc works, you'll get the promised points. – Nemo XXX May 24 '16 at 11:33
  • Ah, I see... this is indeed a bug in pandoc. I just submitted a [pull request](https://github.com/jgm/pandoc/pull/2942) that starts to address the problem... – mb21 May 24 '16 at 15:47
  • Thanks! I really appreciate it that you submitted a pull request. – Nemo XXX May 24 '16 at 17:52
  • sure, btw. pandoc only has troubles when you put `id`s on links and link **to** them... if in your epub you put all your `id`s on `span`s or headings, all should be fine.. – mb21 May 25 '16 at 07:59