0

When trying to copy and paste into a MS word document from a PDF document which has some sets of fonts embedded, the result is illegible.

Several symbols are changed or even disappear.

Using Adobe Acrobat I can check which specific fonts are embedded.

  • Would installing such fonts in Microsoft Word work it out?
  • If so, where can I get or even create those subsets of the fonts I need?
  • If not, how could I solve this problem?
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
GJC
  • 121
  • 1
  • 8

3 Answers3

4

You should check your PDF document's fonts first with the help of the pdffonts utility. That is part of the XPDF package for Windows and can be used without installing, just from a DOS box.

In order to successfully extract text (or copy'n'paste it) from a PDF, the font should either use a standard encoding (not a Custom one), and it should have a /ToUnicode table associated with it inside the PDF.

pdffonts returns a few basic information items about the fonts used by your PDF.

Example output:

$ pdffonts -f 3 -l 5 sample.pdf
  name                      type          encoding     emb sub uni object ID
  ------------------------- ------------- ------------ --- --- --- ---------
  IADKRB+Arial-BoldMT       CID TrueType  Identity-H   yes yes yes     10  0
  SSKFGJ+ArialMT            CID TrueType  Custom       yes yes no      11  0

The command above asked for the fonts used in the page range 3 (first to check) to 5 (last page to check).

In the above case, both used fonts are embedded as subsets (indicated by the XYZABC+-prefixes to their names, as well as by the yes in the emb and the sub columns).

The font SSKFGJ+ArialMT uses a custom encoding, but the PDF has no /ToUnicode for this font, as indicated by the no entry for the column headed uni.

Hence it is not easy to extract text that is shown with this font (extraction would require manual reverse engineering -- but then you can also just "read" the PDF pages).

You should check first, if copy'n'pasting of text works if you use a simple text file as a target (not an MS Word document). If it doesn't, you can already forget about MS Word...


  • Would installing such fonts in Microsoft Word work it out?
  • Very likely: no. (I cannot give a definite answer without having myself access to the PDF in question.)
  • If so, where can I get or even create those subsets of the fonts I need?
  • You could extract the subsetted fonts from the PDF itself. (Funnily, my most popular StackOverflow answer deals with exactly that question -- I dunno why people seem to be so crazy about extracting fonts from PDF files other than for debugging purposes...)
  • If not, how could I solve this problem?
  • There is no solution other than doing this manually.

Update

You can, unfortunately, not get the exactly same info about the fonts used by a PDF via Acrobat or Adobe Reader. What you can get via Menu -> File -> Properties... is

  • the font names,
  • the subset info (but not the prefixes used for subset font names),
  • the encoding and
  • the font type.

But you do not get the info about the presence of a /ToUnicode table.

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Hi! @KurtPfeifle First of all, thank you for replying so thoroughly; now, I am a real newbie so I'd appreciate it so much if you could just offer some step-by-step guidelines. How can I download the XPDF package for Windows and make it work? Also, when you say "There is no solution other than doing this manually", what do you exactly mean? sorry for the non-specific terminology I may use. – GJC Apr 14 '15 at 18:53
  • @GEORGEJUNG: Sorry, can't think of a better and step-by-step guideline than this one. You may want to search for and go through **[my other SO answers](http://stackoverflow.com/search?q=user%3A359307+%5Bpdf%5D+text)** to similar questions -- you may then discover other aspects of the text extraction problem not discussed here. – Kurt Pfeifle Apr 14 '15 at 20:30
  • @GEORGEJUNG: With "doing manually" I meant the choice of two things: Either ***(1)*** copy your text from the PDF manually (type it; difficulty level: easy), or ***(2)*** edit the PDF manually and add a valid `/ToUnicode` object to it (requires lots of specific skills; difficulty level: very hard). – Kurt Pfeifle Apr 14 '15 at 20:34
  • I downloaded the rar file, but after executing pdffonts.exe, just a screen shot of the DOS box ensues... so how should I proceed now? – GJC Apr 15 '15 at 08:46
  • @GEORGEJUNG: It's not a .rar, it's a .zip file. You unpack it, copy the `pdffonts.exe` to some convenient directory. Then open a `cmd.exe` (= *"DOS box"*) window. Inside this window change directory to where your `pdffonts.exe` is located. Then execute `pdffonts.exe c:\path\to\your\pdf.pdf`. – Kurt Pfeifle Apr 15 '15 at 09:31
  • thanks for your patience. I'm working on windows 7 and have no idea at all about commands and the like; unfortunately, I can only find guidelines for Linux, so I hope you can offer me some help as long as it is not too much of an inconvenience. – GJC Apr 15 '15 at 12:14
  • @GEORGEJUNG: This is a site preliminary for *developers*. If you need *user guidelines* about Windows 7 and how to start commands in the "DOS box", you have to find other resources. Sorry, man. – Kurt Pfeifle Apr 15 '15 at 12:41
  • Hi again! The output shows no data concerning the encoding column, [not even its heading](http://s22.postimg.org/mj317flmp/Captura.png), so any help would be frankly appreciated. – GJC Apr 16 '15 at 05:05
  • @GEORGEJUNG: You probably have an older version of `pdffonts` than I do, which does not display that column. – Kurt Pfeifle Apr 16 '15 at 09:29
  • I've checked and my version is xpdfbin-win-3.04.zip. Also, I've read some of your several past posts and noticed you mentioned Forgefont (and others) as a possible solution to similar issues. I do think this answer is quite comprehensive but when I try to upvote it a message pops up saying 15 reputation is required. – GJC Apr 16 '15 at 11:42
  • @GEORGEJUNG: I didn't mention Forgefont, but maybe Fontforge. Fontforge is a font editor/creator software. However, this will not help here. – Kurt Pfeifle Apr 16 '15 at 13:13
1

My work-around is to save the PDF as a lossless or near lossless image such as .tiff format, then create a new PDF from the image and run OCR. Thus I lose no clarity/sharpness in the PDF image and get accurate OCR content that can be copied and pasted. And, yes, lots of folks do something similar with screenshots from protected PDFs to grab all the text (without the need to retype it). Simple non-expert scripts (such as Tornado's "Do It Again" freeware) and PDF generating software make it easy to process hundreds of pages quickly and accurately (at least as accurately as OCR from images can be from relatively high-res images - not screenshots of documents you are not zooming in on or otherwise capturing with tremendously low spatial resolution relative to the original document).

0

Would installing such fonts in Microsoft Word work it out?

Not necessarily, this is because many at times the information regarding the font is not present inside the pdf. In other words, though a reader can render it fine from the binary, the ascii equivalent (possible if font data is present) is not available.

If not, how could I solve this problem?

Since the problem lies in ambiguous pdf standard (which allows removal of font information), one best practice would be OCR.

Solution:

When I ran into similar problems these are the steps I performed

  1. I converted the whole pdf file in to another pdf (with images as each slides). (I found it optimum if I first convert each slide into tiff using Adobe Acrobat. Then would recombine all these tiffs to make one single pdf). The purpose is to get a pure image (binary) based pdf.
  2. Then, run it through the inbuilt OCR of Adobe Acrobat ('Enhance' feature). This make Adobe generate a fresh set of meta data including all relevant font information. Save this PDF
  3. So, now I have a searchable pdf.
Rahul
  • 1,266
  • 1
  • 15
  • 18