1

I have encountered a problem while reading the pdf using pdfbox. My actual pdf is partially unreadable so when i copy and paste the unreadable part in an editor it shows little box symbols, but when i try to read the same file via pdfbox , those characters aren't read (and i don't expect them to be read). What I expect is that I at least get some symbols or some random characters instead of the actual characters. Is there any way to do that. That line is getting selected so it isn't an image. Has anyone found any workaround this?

There is a pdfbox example where we override writeString method under pdfTextStripper class to get some extra font properties. I am using that method to get my text and some font properties. So my question was why does the pdfbox doesn't read every character(it may print gibberish). But in my case, I counted the no. of times the method was called(each method call correspond to each character) and saw that the no. of method calls did match the no.of characters in output text but didn't match the total no. of characters in the pdf. Here's a sample pdf, the word "Profit" is unreadable and pdf doesn't even display gibberish for this word, It just altogether skips it. Here's the link. https://drive.google.com/file/d/0B_Ke2amBgdpedUNwVTR3RVlRTFE/view?usp=sharing

ANKIT
  • 126
  • 2
  • 11
  • 1
    Please **A** share the pivotal code (or if you are using a PDFBox example as is, name it) and **B** share a sample PDF to allow reproduction of the issue. – mkl Jun 16 '16 at 15:31
  • there is a pdfbox example where we override writestring method to get some extra font properties. – ANKIT Jun 17 '16 at 04:18
  • @ANKIT hello.. can you please share what did you do resolve this issue?? I am facing similar problem with similar PDF.. I struggling with the PDF reading part since many weeks but not able read pdf accurately .. i would be very thankful for any help in this matter.. please let me know your solution or findings.. if you have material which can be helpful in resolving this issue please email me at viru.nalawade@gmail.com .. – Viraj Nalawade Oct 31 '16 at 06:15

1 Answers1

10

The first file "PnL_500010_0314.pdf"

Indeed, actually the whole line "Statement of Profit and Loss for the year ended March 31, 2014" and much more cannot be extracted; inspecting the contents the reason becomes obvious: This text is written using a composite font which neither has an Encoding nor a ToUnicode entry to allow identifying the character in question.

The org.apache.pdfbox.text.PDFTextStreamEngine (from which PDFTextStripper is derived) method showGlyph shortly before calling processTextPosition (which PDFTextStripper implements and from which it retrieves its text information) contains this code:

// use our additional glyph list for Unicode mapping
unicode = font.toUnicode(code, glyphList);

// when there is no Unicode mapping available, Acrobat simply coerces the character code
// into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want
// this, which is why we leave it until this point in PDFTextStreamEngine.
if (unicode == null)
{
    if (font instanceof PDSimpleFont)
    {
        char c = (char) code;
        unicode = new String(new char[] { c });
    }
    else
    {
        // Acrobat doesn't seem to coerce composite font's character codes, instead it
        // skips them. See the "allah2.pdf" TestTextStripper file.
        return;
    }
}

The font in question does not offer any clues for text extraction. Thus, unicode here is null.

Furthermore, the font is composite, not simple. Thus, the else clause is executed and processTextPosition is not even called.

PDFTextStripper, therefore, is not informed at all that the line "Statement of Profit and Loss for the year ended March 31, 2014" even exists!

If you replace that

    else
    {
        // Acrobat doesn't seem to coerce composite font's character codes, instead it
        // skips them. See the "allah2.pdf" TestTextStripper file.
        return;
    }

in PDFTextStreamEngine.showGlyph by some code setting unicode, e.g. using the Unicode replacement character

    else
    {
        // Use the Unicode replacement character to indicate an unknown character
        unicode = "\uFFFD";
    }

you'll get

57
THIRTY SEVENTH ANNUAL REPORT 2013-14
STANDALONE FINANCIAL STATEMENTS
�������������������������������������������������������������
As per our report attached. Directors
For Deloitte Haskins & Sells LLP Deepak S. Parekh Nasser Munjee R. S. Tarneja
Chartered Accountants �������� B. S. Mehta J. J. Irani
D. N. Ghosh Bimal Jalan
Keki M. Mistry S. A. Dave D. M. Sukthankar
Sanjiv V. Pilgaonkar ���������������
Partner �����������������������
Renu Sud Karnad V. Srinivasa Rangan Girish V. Koliyote
������, May 6, 2014 Managing Director ������������������ �����������������
Notes Previous Year
� in Crore � in Crore
INCOME
����������������������� 23  23,894.03  20,796.95 
���������������������������� 24  248.98  315.55 
������������ 25  54.66  35.12 
Total Revenue  24,197.67  21,147.62 
EXPENSES
Finance Cost 26  16,029.37  13,890.89 
�������������� 27  279.18  246.19 
���������������������� 28  86.98  75.68 
�������������� 29  230.03  193.43 
������������������������������ 11 & 12  31.87  23.59 
Provision for Contingencies  100.00  145.00 
Total Expenses  16,757.43  14,574.78 

PROFIT BEFORE TAX  7,440.24  6,572.84 
�����������
�������������  1,973.00  1,727.68 
�������������� 14  27.00  (3.18)
PROFIT FOR THE YEAR 3  5,440.24  4,848.34 
EARNINGS PER SHARE��������������� 2) 31
- Basic 34.89 31.84
- Diluted 34.62 31.45
�������������������������������������������������������������

Unfortunately that PDFTextStreamEngine.showGlyph method uses some private class members. Thus, one cannot simply override it in one's own PDFTextStripper class using the original method code with the change indicated above. One either has to replicate nearly all functionality of PDFTextStreamEngine in one's own class, or one has to resort to Java reflection, or one has to patch PDFBox classes themselves.

This architecture is not exactly perfect.

The second file "Bal_532935_0314.pdf"

The case of the second file is caused by the same piece of PDFBox code quoted above. As this time, though, the font is simple, the other code block is executed:

    if (font instanceof PDSimpleFont)
    {
        char c = (char) code;
        unicode = new String(new char[] { c });
    }

What happens here is pure guesswork: If there is no information for mapping glyph code to Unicode, let's assume the mapping is Latin-1 which embeds trivially into char. As becomes visible in the OP's second file, this assumption does not always hold.

If you don't want PDFBox to make assumptions like these here, also replace the if block above by

    if (font instanceof PDSimpleFont)
    {
        // Use the Unicode replacement character to indicate an unknown character
        unicode = "\uFFFD";
    }

This results in

Aries Agro Care Private Limited
1118th Annual Report 2013-14
Balance Sheet as at 31st March, 2014
Particulars Note
No.
 As at 
31 March, 2014
Rupees
 As at
31 March, 2013
Rupees
I. EQUITY AND LIABILITIES
(1) Shareholder's Funds
(a) ������������� 3  100,000  100,000
(b) Reserves and Surplus 4  (2,673,971) ������������
 (2,573,971) ������������
(2) Current Liabilities
(a) Short Term Borrowings 5  5,805,535 �����������
(b) Trade Payables 6  159,400 ���������
(c) ������������������������� 7  2,500  22,743 
 5,967,435  5,934,756 
TOTAL  3,393,464 �����������
II. ASSETS
(1) Non-Current Assets
(a) �������������������� �  - -
 - -
(2) Current Assets
(a) ����������������������� 9  39,605 �������
(b) ����������������������������� 10  3,353,859 ����������
 3,393,464 ����������
TOTAL  3,393,464 ����������
��������������������������������
The Notes to Accounts 1 to 23 form part of these Financial Statements
As per our report of even date For and on behalf of the Board
For Kirti D. Shah & Associates 
��������������������� 
�����������������������������
Dr. Jimmy Mirchandani
Director
Kirti D. Shah 
Proprietor 
Membership No 32371
Dr. Rahul Mirchandani 
Director
Place : Mumbai. 
Date :- 26th May, 2014.
mkl
  • 90,588
  • 15
  • 125
  • 265
  • 1
    Thanks mkl. This was the answer I was looking for.I think I may have to copy paste the entire class of pdfstreamengine because I need to know if data exists in the pdf or not. How did u got the unreadable part? Did you replicate the entire class(since method is private). – ANKIT Jun 17 '16 at 10:27
  • there are some pdfs where the data is still unreadable but pdfbox reads it and gives gibberish letters in their place. Is there any way to know when we get gibberish characters. eg:- share capital word is unreadable and i am getting some random letters in their place. So is there any way to know that those are just random letters?(without using dictionary as there are unwanted spaces which would not allow for this method) Here's the link to Pdf:-https://drive.google.com/file/d/0B_Ke2amBgdpebm96U05FcWFsSXM/view?usp=sharing – ANKIT Jun 17 '16 at 10:40
  • *How did u got the unreadable part? Did you replicate the entire class(since method is private)* - I test against a copy of the PDFBox codeanyways, so I simply edited it. – mkl Jun 17 '16 at 11:09
  • How can I do the same , I tried loading the src folder in eclipse but it didn't work. – ANKIT Jun 17 '16 at 11:55
  • PDFBox uses [Apache Maven](https://maven.apache.org/) for build management. Simply import PDFBox as existing Maven project into eclipse now you can easily edit and build it. Hopefully you use Maven for your test project, too, because then it suffices to add PDFBox as dependency of your project to make eclipse use it. At least if you have "Resolve dependencies from Workspace projects" clicked in the project properties. – mkl Jun 17 '16 at 12:20
  • Thanks mkl. Will try that – ANKIT Jun 17 '16 at 12:28
  • Maybe I am being greedy but I don't want to replace the first if statement as it is making correct guesses for some files. So can I distinguish between the correct and wrong guesses? – ANKIT Jun 17 '16 at 12:49
  • I copied entire pdfbox in my src folder and know when i am creating an object of pdftextStripper class , an error named "java.lang.ExceptionInInitializerError" occurs. how to solve this? – ANKIT Jun 17 '16 at 13:43
  • *So can I distinguish between the correct and wrong guesses?* - You might try dictionary lookups or our for comparison. – mkl Jun 17 '16 at 16:34
  • *an error named "java.lang.ExceptionInInitializerError" occurs. how to solve this?* - I only have built pdfbox using maven. You seem to try differently. – mkl Jun 17 '16 at 16:36
  • Dictionary lookup won't work since there are unwanted spaces in pdfs. Yes I tried without maven. Anyways that issue is resolved now. Thanks for your time. – ANKIT Jun 17 '16 at 17:17
  • 1
    If my answer essentially helped you to get on track, please accept it (click the tick at its upper left). – mkl Jun 20 '16 at 04:14
  • Sorry I am new here and didn't know about accepting answers. Yes it did helped me and I have accepted it now. – ANKIT Jun 21 '16 at 08:16
  • @mkl: What is version code of PDFBox you are using? I cant find PDFTextStreamEngine in 1.8 instead it has PDFStreamEngine – sampopes Sep 03 '17 at 16:34
  • @sampopes the `PDFTextStreamEngine` has been introduced in pdfbox 2.0.0 and still is in the current 3.0.0-SNAPSHOT. – mkl Sep 04 '17 at 04:10
  • @mkl I got that. Do you think I can edit the code in 2.0 to read from my glyphs file, they have an additionGlyphs file reading code I can edit that file to make it work? – sampopes Sep 06 '17 at 15:43