5

I am using to pdf2htmlEX in order to convert pdf files to html. I also extract the text from the file afterwards.

The Problem:

I encountered with a file that the text at the converted html is unreadable: https://dspace.mit.edu/openaccess-disseminate/1721.1/101159

The command i use:

pdf2htmlEX --tounicode 1 ./file.pdf

The text on the html has many spaces and many quotes - enter image description here

[2]"M."Ha h n ,"O ."B ar bie ri,"F.P ."C a m p a na ,"R ."K öt z,"R ."G alla y,"A p p l."Ph ys ."A :"M a te r."S ci."P ro ce ss."8 2 "(2 00 6 )"

Setting other values for the --tounicode arg make the text is gibberish.

There is an online tool that uses this library and the html produced there is just fine, which makes it not a pdf2htmlEX bug but a configuration or versions problem. May be something related to poppler or fontforge.

Versions:

pdf2htmlEX version 0.14.6
Copyright 2012-2015 Lu Wang <coolwanglu@gmail.com> and other contributors
Libraries: 
  poppler 0.54.0
  libfontforge 20180906
  cairo 1.14.6
Default data-dir: /usr/local/share/pdf2htmlEX
Supported image format: png jpg svg

Tried also using the new repository that sustain this project and getting the same results, see issue: https://github.com/pdf2htmlEX/pdf2htmlEX/issues/92

For your knowledge, pdf2htmlEX uses wide range of characters as spaces such as " ' ( ) +. So replacing them all is not an option.

Any way to make pdf2htmlEX not using these characters?

Montoya
  • 2,819
  • 3
  • 37
  • 65
  • the page by itself is fine when viewed in browser. Because the css modifies the letter spacing. are you looking for the raw html with proper sentence instead of the rendered version? – karthick Sep 14 '18 at 22:20
  • Yes. The output looks good on the browser, its the raw html that is problematic for me. – Montoya Sep 15 '18 at 23:09

1 Answers1

-1

I think the following two steps will work:

  1. Remove unnecessary spaces and quotes by using regular expression.
  2. Put/add paragraph tag for every references like below:
<div>
::before
<p>[2] something </p>
::after
</div>
  • It won't work because of two main reasons: 1. This characters are used in the embeded font and changing them will cause an unknown behaviour. 2. Removing all quotes from the document will remove real ones also, while quotes aren't the only kind of character it may insert inside the raw html. – Montoya Sep 17 '18 at 04:53