I am trying to convert a markdown file to pdf using pandoc
on Windows system. Since my markdown contains Chinese characters, I use the following command to produce the pdf:
pandoc --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdfbut
But pandoc complains that the file contains non-utf8 characters that it can not handle, the exact error message is:
Error producing PDF.
! Undefined control sequence.
pandoc.exe: Cannot decode byte '\xae': >Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream
According to what I have find in the internet. This is largely due to the encoding of the markdown file and may have nothing to do with pandoc. My file contains a lot of chinese characters and English characters. I have converted it to utf-8 encoding.
Things I have tried but without success
Grep for the non-utf8 character
Following the instruction here and here. I have verified that the system locale is set to UTF-8, output of localectl status
is:
System Locale: LANG=en_US.UTF-8
VC Keymap: us
X11 Layout: us
I tried to grep for non-utf8 character. Command used is grep -axv '.*' test.md
. But the command output nothing. (I thought that means there are no invalid characters which can not be decoded by utf-8.)
Try to discard invalid characters
I followed the instruction here trying to remove non-utf8 characters from my file. The command I use is:
iconv -f utf-8 -t utf-8 -c test.md > output.md
After that, When I tried to convert output.md
to pdf using pandoc
. I still met the same error message, which suggests that the file still contains non-utf8 characters.
My question
How can I pinpoint which part of file is causing the problem or how to really remove the non-utf8 character from the file so that I can compile it with error?
Other information
You can find the markdown file here.
If you are using Linux system, you may need to set
CJKmainfont
to other valid Chinese font name in your system.