2

I use a couple of different programs to convert pdf files to txt files. Usually, this results in good-looking text. Sometimes, it doesn't. I have a set of files that convert in the following way:

Text I can read: Your Account Summary

Copy, paste into Notepad++: copyPasteIntoNotepadPlusPlus

Ghostscript: seems to be a garbage file. Full of xEF, xBF characters.

xPdf: gives me a file full of stuff like this: Ç+6 3 É+C ÌÍÍÌ; ÆÁÅ ÅAÁ

It seems like the copy-paste method is closest to English language because it seems that each of those characters represents an alphabet character. SO == Y, SI == o, STX == u, etc.

I would like to convert these pdf files to English text.

Ben Walker
  • 2,037
  • 5
  • 34
  • 56
  • This has been asked countless times in SO. Short answer: your file does not allow text extraction, use an OCR library instead. – yms Sep 10 '13 at 19:56
  • If the copy-paste method is actually some kind of representation of characters, though, I would assume that I could extract that code, and then convert it to real text. Am I incorrect? – Ben Walker Sep 10 '13 at 20:20
  • Not really... they could be just indexes in an array of objects that tell the PDF reader how to draw each character, whitout any further info on the text represented. Please look for questions about PDF text extraction in SO, there are many good answers here that cover these issues. – yms Sep 10 '13 at 20:28
  • 2
    Check this one for example: http://stackoverflow.com/questions/17193839/where-can-i-a-mapping-of-identity-h-encoded-characters-to-ascii-or-unicode-chara/17649484#17649484 – yms Sep 10 '13 at 20:59

1 Answers1

1

It is normally that Unicode symbols are looks like a

xEF, xBF

. You needed an additional transformation from Unicode to user-friendly letters.

stanlyF
  • 266
  • 1
  • 16