2

I am trying to convert a postscript file which contains some telugu Font (i.e Vani Bold). After converting the file into pdf I am not able to copy the text from generated pdf file .When I see the properties of pdf file in centos document viewer it is showing like below enter image description here

I am using below command to convert postscript file to pdf

bin/gs -dBATCH -sDEVICE=pdfwrite -sNOPAUSE -dQUITE -sOutputFile=/home/cloudera/Desktop/PrintTest/telugu.pdf /home/cloudera/Desktop/PrintTest/VirtualPrinter_27_09_2016_19_11_41_691.ps

I tried with ghostscript 9.19 and 9.20 as well,but no change.

Following is the link to my postscript file which I am trying to convert into pdf. click here for postscript file

I have been struggling with this since 10 days .Please provide some solution for this.

prasad
  • 339
  • 8
  • 23

1 Answers1

0

I can tell you why you can't copy & paste the text, but I'm not sure I can provide an acceptable solution.

First, not all pdf viewers can deal with unicode characters (for example,xpdf can't, it just ignores them, while mudpf and qpdfview work).

Second, to be able to convert font glyphs to unicode characters, the font object in the PDF file must contain a /ToUnicode property. If you look at the generated PDF after decompression (mutool clean -d), you can see that the Vani font in object 8 0 doesn't have it, while both the Arial font in object 10 0 and the Calibri font in object 12 0 do.

So very likely the Vani font is missing this unicode translation information, you need to either add this information (e.g. with fontforge), or choose a different font that has this information.

Related question:

Community
  • 1
  • 1
dirkt
  • 463
  • 1
  • 4
  • 12
  • Here's a screenshot of PDFDebugger: http://imgur.com/a/0w043 . Feel free to include it. It shows that ToUnicode is missing, and that the unicode column is empty. – Tilman Hausherr Sep 28 '16 at 16:40
  • Thanks you so much for your response @dirkt. So how can I add unicode translation information to Vani font.Please provide some reference ,so that I will add unicode and try.Please help me out of this. – prasad Sep 29 '16 at 06:04
  • I've never designed a font, so I don't know specifics. I know there are programs like [Fontforge](https://fontforge.github.io/en-US/) you can use to edit fonts, so I'd suggest you take a look at both your Vani and Arial-Bold fonts using it, try to figure out which table encodes the unicode translation, and add it. Also read the Fontforge documentation. – dirkt Sep 29 '16 at 06:16
  • Also, the fonts are embedded in your postscript file, so even you manage to fix your font, you still won't be able to convert this particular postscript file into a pdf you can copy/paste. You'd need to regenerate the postscript file with the new font, or somehow replace the embedded font in the postscript file with the new one. – dirkt Sep 29 '16 at 06:20
  • Hi @Tilman Hausherr ,Where should I include it ?.I did not get what you suggested me to do ,can you please explain me once.It will really a great help for me. – prasad Sep 29 '16 at 07:22
  • @prasad my remark was directed to dirkt, to include the screenshot in his answer. It's just a visualization of the problem in your PDF. – Tilman Hausherr Sep 29 '16 at 07:26
  • @prasad it is possible to add toUnicode information, see https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0/39644941#39644941 however this is incredibly difficult and a lot of work. It's easier to start with a correct font. – Tilman Hausherr Sep 29 '16 at 07:27
  • hi@Tilman Hausherr ,Is there any alternate solution to my problem ?. Or how to change the font ?.Actually my postscript file is getting generated from VirtualPrinter.Please help me out of this. – prasad Sep 29 '16 at 07:32
  • @TilmanHausherr and prasad: If the goal is to change that single pdf, it might be easier to just use a text editor and add a CMAP in the decompressed pdf and then fix the xref, instead of writing a program for it. But that would solve the problem only for that particular pdf. – dirkt Sep 29 '16 at 07:43
  • prasad: Do you want a solution for that particular pdf, or for all pdf's you are going to produce in the future? And as I already said, an alternative solution to *editing* the font is to *find* another font that already has the correct unicode information. – dirkt Sep 29 '16 at 07:45
  • Hi@dirkt ,I need solutions for future documents also and even for other fonts also which raises this copy problem.So please give me base solution.I am really struggling a lot with this.Please provide any solution. – prasad Sep 29 '16 at 07:53
  • Hi@dirkt,what can I do to to replace the font with other font?.And please let me know that the problem is with postscript file only or other? – prasad Sep 29 '16 at 07:56
  • Hi@TilmanHausherr, is it possible to convert that postscript to another postscript ,so that to substitute with correct fonts which contains /ToUnicode property? – prasad Sep 29 '16 at 11:28
  • @prasad I'm not a postscript expect, sorry. Btw put at least one space before "@", without it people don't get an alert. – Tilman Hausherr Sep 29 '16 at 13:40
  • Hi @TilmanHausherr ,where can I get information about /ToUnicode ? – prasad Sep 29 '16 at 14:47
  • @see https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0/39644941#39644941 and in "9.10 Extraction of Text Content" in the PDF 32000 specification https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf – Tilman Hausherr Sep 30 '16 at 09:45