5

I have a pdf file from which I want to extract text. But because of missing toUniCode map, I am not able to do it.

./pdffonts /Users/subhashlengare/Downloads/pqr39_abc.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
ATRTHG+TT1CABt00                     TrueType          yes yes no      23  0
VFQVYH+TT1CAEt00                     TrueType          yes yes no      19  0
ODNMDG+TT1CAFt00                     TrueType          yes yes no      31  0
DXGYRQ+TT1CB0t00                     TrueType          yes yes no      27  0
VFQVYH+TT1CB1t00                     TrueType          yes yes no       7  0
ArialMT                              TrueType          no  no  no     295  0
NXBBUP+TT1CC0t00                     TrueType          yes yes no      53  0
NXBBUP+TT1CC1t00                     TrueType          yes yes no      65  0
KDGXKF+TT1CC4t00                     TrueType          yes yes no     104  0
VRCBAT+TT1CC5t00                     TrueType          yes yes no     100  0
QTNBCJ+TT1CC2t00                     TrueType          yes yes no      88  0
NXBBUP+TT1CC6t00                     TrueType          yes yes no      96  0
NXBBUP+TT1CC7t00                     TrueType          yes yes no     116  0
NXBBUP+TT1CC8t00                     TrueType          yes yes no     128  0

How can we add back missing ToUniCode map, so that text extraction works well?

subhashlg26
  • 993
  • 1
  • 11
  • 25
  • See answer for PDFBox: https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0 However this is very tricky and has to be done for every font so it would probably be faster to use OCR. – Tilman Hausherr May 29 '17 at 11:48
  • iText is currently looking into OCR tools and I'd like to investigate whether it's possible to support this use case for existing documents. It might take a while, but we'd like to come back to this. – blagae Jul 18 '17 at 11:51
  • I'm not very familiar with the particular library (itext), but possibly-useful things are (1) read the [PDF specification](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) to understand whether to put the map, if the library didn't already support it (2) look for a library to extract the map from the TrueType font file, or OCR individual glyphs if the map in the font file itself is wrong. – user202729 Jan 17 '21 at 06:24

0 Answers0