Alternatives to pdftohtml

Question

I'm experimenting with pdftohtml but I'm finding that it's occasionally having difficulty parsing tables correctly. It's grouping the text from two columns into a single cell, which makes my attempts to parse the resulting data futile!

Note that this occurs only once or twice within a PDF and is quite unpredictable.

I've tried the latest versions of pdftohtml (including the 0.40a beta), but to no avail.

Is anyone aware of any Linux-compatible equivalents that might be worth trying?

Thanks,

Sam

Have you submitted a bug report? PDFs are notoriously difficult to parse, and an incredible amount of time has gone into the poppler tools. Your best bet might be to see what you can do to help upstream. — efrey, May 15 '12 at 14:11

score 1 · Answer 1 · answered Jan 29 '15 at 11:19

pdf2htmlEX is the best pdf-to-html I've seen.

install: brew install pdf2htmlex

I had to use brew install -f pdf2htmlex

run example: pdf2htmlEX --embed cfijo --dest-dir 'your-directory' your.pdf

that creates a new directory with the .html and ref'd images

Alternatives to pdftohtml

1 Answers1