2

I have a pdf file which I have a problem extracting text from it - using an itextsharp api.

some of the numbers are replaced by other numbers or backslashes : "//"

The pdf file was originally came from MS Word and exported to pdf using "Save as pdf", and i have to work with the pdf file and not the Doc.

You can see the problem very clearly when you try to copy and paste some numbers from the file For example - if you try to copy and paste a 6 digit number in the bottom you can see that it changes from 201333 to 333222.

You can also see the problem with the date string : 11/4/2016 turns into // // 11110

When I print the pdf file using adobe Pdf converter printer on my computer, it get fixed, but i need to fix it automaticlly, using C# for example

Thanks

The file is shared here : https://www.dropbox.com/s/j6w9350oyit0od8/OnePageGili.pdf?dl=0

  • Please inspect you PDF and check whether it is an issue like the one [answered here](http://stackoverflow.com/a/22688775/1729265). If you cannot check yourself, please share your PDF for analysis. – mkl Apr 18 '16 at 08:25

1 Answers1

2

In a nutshell

iTextSharp text extraction results exactly reflect what the PDF claims the characters in question mean. Thus, text extraction as recommended by the PDF specification (which relies on these information) always will return this.

The embedded fonts contain different information. Thus, text extraction methods disbelieving this information may return more satisfying results.

In more detail

First of all, you say

I have a pdf file which I have a problem extracting text from it - using an itextsharp api.

and so make it sound like an iTextSharp-specific issue. Later, though, you state

You can see the problem very clearly when you try to copy and paste some numbers from the file

If you can also see the issue with copy&paste, it is not an iTextSharp-specific issue but either an issue of multiple PDF processors including the viewer you copied&pasted with or it simply is an issue of the PDF you have.

As it turns out, it is the latter, you have a PDF that lies about its contents.

For example, let's look at the text you pointed out:

For example - if you try to copy and paste a 6 digit number in the bottom you can see that it changes from 201333 to 333222.

Inspecting the PDF page content stream, you'll find those six digits generated by these instructions:

/F3 11.04 Tf
...
[<00150013>-4<0014>8<00160016>-4<0016>] TJ

I.e. the font F3 is selected (which uses Identity-H encoding, so each glyph is represented by two bytes) and the glyphs drawn are from left to right:

0015
0013
0014
0016
0016
0016

The ToUnicode mapping of the font F3 in your PDF now claims:

1 beginbfrange
<0013> <0016> [<0033> <0033> <0033> <0032>]
endbfrange 

I.e. it says

  • glyph 0013 represents Unicode codepoint 0033, the digit 3
  • glyph 0014 represents Unicode codepoint 0033, the digit 3
  • glyph 0015 represents Unicode codepoint 0033, the digit 3
  • glyph 0016 represents Unicode codepoint 0032, the digit 2

So the string of glyphs drawn using the instructions above represent 333222 according to the ToUnicode map.

The PDF specification presents the ToUnicode mapping as the highest priority method to map a character code to a Unicode value. Thus, a text extractor working according to the specification will return 333222 here.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you very much for the detailed information . is there any ItextSharp solutiion ? – Gili Givoni Apr 18 '16 at 13:00
  • The solution to look for would be to fix the PDF. For a farily sure way to fix it, one can use iTextSharp as the PDF manipulation framework but one would use additional resources, in particular a font library and optimally the fonts from which the subset fonts in your document have created, and it may prove non-trivial, at least quite some work. To do this one should have some knowledge of font program and PDF internals. – mkl Apr 18 '16 at 15:00