1

There are pdfs (A) that if we are unable to copy characters using a reader, and pdfs (B) whose characters are copiable but when pasting into a text editor, it becomes all human-unreadable code. (Encryption in this context doesn't mean password protected).

  1. How to identify these (A) and (B) types of pdfs programmatically, python is preferred?
  2. Is it possible to extract the text correctly from these files?
yuma4012
  • 61
  • 2
  • Welcome to Stack Overflow. Unfortunately, your question isn't very clear. Please read [ask]. Can you provide examples of both cases? – ChrisGPT was on strike Aug 27 '20 at 12:14
  • 1
    by A I think you mean the document permissions and by B I think you mean that the file doesn't include a correct `ToUnicode` map (or your reader is ignoring it). can you confirm? note that this currently doesn't seem like a programming question, no libraries nor code at all is referenced, hence I'm tempted to close – Sam Mason Aug 27 '20 at 12:19
  • It's hard to understand your needs based on the limited information in your question. Please check out this post to see if it is useful to your needs: https://stackoverflow.com/questions/58226546/python-data-extraction-from-an-encrypted-pdf/58295892#58295892 – Life is complex Aug 27 '20 at 12:23

0 Answers0