I'm trying to PDF scrape a list of physician names. The file appears to be in mixed encoding.
When I copy/paste a single physician's name (page 51), I get this:
Dandona, Suklesh
If I paste just the jibberish part to a text file and try enca, I get:
enca -L none CHC_test.txt
Universal transformation format 8 bits; UTF-8
Which ain't it.
The wrinkle here that makes this not a duplicate of previous questions is that if I just view the file in a PDF viewer I can see the address. It's (typing it out): 1601 Main St Suite 306
So how do I convert the addresses in this file? enca
doesn't seem to take known text strings. I guess I could run every single supported encoding through iconv
programmatically and see if the result equals what I have typed out below. Since R has an iconv
interface I might do just that, but perhaps someone has a better solution?
I'm aware of the usual caveats about encoding: there's no way to know for sure, unicode is not an encoding, etc. I have read Joel, I promise. :-D