I have been trying a long time to get the IIHF PDF's (example here: http://stats.iihf.com/Hydra/349/IHM349131_74_3_0.pdf) to a parseable form.
Now I've finally did it, because Google's cache stores a HTML version from it (http://webcache.googleusercontent.com/search?q=cache:http://stats.iihf.com/Hydra/349/IHM349131_74_3_0.pdf) and it could be parsed easily.
The only problem is, that Google doesn't cache every PDF they have and even if they cache a file, it could take days to appear there.
Is there any way to get those HTML versions via any API or even manually?
Edit: These PDFs have somehow corrupted character maps, so that normal PDF to HTML converters can't convert them. Forgot to say.