PDF to HTML via Google?

Asked May 01 '13 at 13:48

Active May 01 '13 at 19:19

Viewed 1,346 times

I have been trying a long time to get the IIHF PDF's (example here: http://stats.iihf.com/Hydra/349/IHM349131_74_3_0.pdf) to a parseable form.

Now I've finally did it, because Google's cache stores a HTML version from it (http://webcache.googleusercontent.com/search?q=cache:http://stats.iihf.com/Hydra/349/IHM349131_74_3_0.pdf) and it could be parsed easily.

The only problem is, that Google doesn't cache every PDF they have and even if they cache a file, it could take days to appear there.

Is there any way to get those HTML versions via any API or even manually?

Edit: These PDFs have somehow corrupted character maps, so that normal PDF to HTML converters can't convert them. Forgot to say.

edited May 01 '13 at 19:19

Nelson

asked May 01 '13 at 13:48

Miika Arponen

What language are you using? if you're using Java you can use http://pdfbox.apache.org/ which lets you extract content from existing PDF's – Nelson May 01 '13 at 13:51
I am using PHP. But I edited one little problem more to the original question. – Miika Arponen May 01 '13 at 16:59
Try this page: http://stackoverflow.com/questions/956508/convert-pdf-to-html – Topological Sort Feb 04 '15 at 19:50

0 Answers0