15

I am writing a Master's thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

or

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).

Could anybody help me???

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Michal_R
  • 279
  • 1
  • 3
  • 9
  • Sometimes, you simply cannot get the text out without resorting to OCR (optical character recognition). This sounds like one of them. – Mark Storer Jun 22 '11 at 17:48

8 Answers8

10

Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:

  • Open 'File' menu,
  • select 'Save as...',
  • select 'Text (normal) (*.txt)',
  • browse to the target directory,
  • type the name you want to use for the text file.

You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....

It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

Update

You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

How to interpret this table?

  • The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
  • Both fonts are of type TrueType.
  • Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn). However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Thanks. I was wondering how to create a pdf file with perfect text view but gibberish copy-paste text? https://unix.stackexchange.com/questions/554416/how-to-create-a-pdf-file-with-gibberish-text-for-copy-and-paste – Tim Nov 27 '19 at 13:07
6

If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.

The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.

My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.

Rowan
  • 2,384
  • 2
  • 21
  • 22
2

When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!

It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!

Community
  • 1
  • 1
Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
  • How did you test this would work for OP's PDF? I don't see a link. – Jongware Feb 28 '16 at 15:15
  • @RadLexus He didn't provide a document did he? I had the same problem and it helped me. No reason to downvote IMHO. – Michel de Ruiter Feb 28 '16 at 19:28
  • @Michel as *he didn't provide a document*, how could you claim that *Chrome does copy normal readable characters*? (I didn't down vote. But in combination with your comment I really consider to. – mkl Feb 28 '16 at 19:47
  • The problem, as pointed out by others, is that the file must not contain the encoding of its fonts. Without these, it's not possible to copy plain text out of it. (To be absolutely sure, I'd need to see OP's file. But I'm equally positive your method simply won't work - not even Chrome will succeed where Adobe's own Acrobat fails.) – Jongware Feb 28 '16 at 21:48
  • 1
    I wish I had a public example PDF to prove this works (at least for some documents). – Michel de Ruiter Feb 28 '16 at 23:25
  • Okay, that's fair. Browse Stack Overflow for similar questions - it has been asked tons and tons of times before. There is bound to be one with an example file. – Jongware Feb 28 '16 at 23:37
  • 1
    I can confirm that it works, I cant paste the text here as the documents are confidential but we had jibberish when trying copy paste from Adobe Reader and standard text when using Chrome's Native PDF viewer. – Ethernal Jun 01 '16 at 06:26
1

What was the PDF created with. Some PDFs do not contain any encoding information, just the data to draw it. So there is no way to extract the data.

mark stephens
  • 3,205
  • 16
  • 19
0

Select the text you wish to copy. Right click Choose option "Export Selection as" In the dialog box, choose a file name and save the new file as Rich Text Format (RTF) Open RTF to see your text!

Eapen
  • 11
0

The best way to deal with this is Convert PDF file to Word by using this website. https://www.ilovepdf.com/pdf_to_word

The garbage issue will be fixed

-1

The best way to deal with this is (assuming you have Adobe Acrobat, or something similar, not sure if Reader can do this) is save the doc as a JPEG. Then recompile all the images as a single pdf, then use the OCR function to find text in the pages, then you can copy and paste the text.

user6096423
  • 133
  • 1
  • 3
  • 15
-4

PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.

Ghostrider
  • 7,545
  • 7
  • 30
  • 44
  • i am thinking about workaround for these files, by using OCR. – Michal_R May 28 '10 at 10:20
  • 5
    That's a very misleading answer actually. Text and vector art are both first-class citizens in a PDF world. The problem is not that this is a vector format, the problem is that some PDF writers don't put in all necessary information to be able to correctly copy and paste. – David van Driessche Apr 05 '15 at 13:36