-1

I have a PDF document with the following sample text (screenshot) -

Screenshot of the sample text from the PDF document

But when I copy and paste it to either word or other text editors all I see is the weird characters :

    

I am not quite sure why does it giving me weird square boxes instead of pasting the clear human-readable letters (just like the screenshot). Can someone help me how can I get rid of this issue ? Or at least what shall I do to identify the root cause of this strange issue ?

mkl
  • 90,588
  • 15
  • 125
  • 265
Panchu
  • 1,219
  • 1
  • 9
  • 13
  • 1
    Apparently your pdf misses the entries required for text extraction. Displaying glyphs is possible without any hint concerning a unicode code point representing that glyph as a character. – mkl Aug 15 '20 at 08:42
  • @mkl - If I understood correctly, so this can't be fixed any more ? – Panchu Aug 16 '20 at 03:18
  • 1
    Depending on the number of distinct font objects in the pdf, you may attempt to inject information in that regard, compare [this answer](https://stackoverflow.com/a/39644941/1729265). And another option is OCR... – mkl Aug 16 '20 at 08:26
  • Thanks for your suggestion @mkl. I went with the OCR approach and it resolved by issue. – Panchu Aug 26 '20 at 15:45

1 Answers1

1

================== Workaround found ==================

  • I tried converting the document's corrupted unicode to a standard ANSCI unicode formats. But most of the online services couldn't recognize these garbage/weird characters.
  • This issue could be resolved by some programming, but I don't want to invest time with the programming approach and preferred on the fly approach.
  • Finally, as suggested by the user 'mkl', converting this document by using the OCR services like "Sedja"/ "Adobe OCR" resolved by issue.
Panchu
  • 1,219
  • 1
  • 9
  • 13