I am using to pdf2htmlEX
in order to convert pdf files to html. I also extract the text from the file afterwards.
The Problem:
I encountered with a file that the text at the converted html is unreadable: https://dspace.mit.edu/openaccess-disseminate/1721.1/101159
The command i use:
pdf2htmlEX --tounicode 1 ./file.pdf
The text on the html has many spaces and many quotes -
[2]"M."Ha h n ,"O ."B ar bie ri,"F.P ."C a m p a na ,"R ."K öt z,"R ."G alla y,"A p p l."Ph ys ."A :"M a te r."S ci."P ro ce ss."8 2 "(2 00 6 )"
Setting other values for the --tounicode
arg make the text is gibberish.
There is an online tool that uses this library and the html produced there is just fine, which makes it not a pdf2htmlEX bug but a configuration or versions problem. May be something related to poppler or fontforge.
Versions:
pdf2htmlEX version 0.14.6
Copyright 2012-2015 Lu Wang <coolwanglu@gmail.com> and other contributors
Libraries:
poppler 0.54.0
libfontforge 20180906
cairo 1.14.6
Default data-dir: /usr/local/share/pdf2htmlEX
Supported image format: png jpg svg
Tried also using the new repository that sustain this project and getting the same results, see issue: https://github.com/pdf2htmlEX/pdf2htmlEX/issues/92
For your knowledge, pdf2htmlEX uses wide range of characters as spaces such as " ' ( ) +. So replacing them all is not an option.
Any way to make pdf2htmlEX not using these characters?