1

enter image description here

Hi All,

This is a question related to itextsharp version 5.5.13.1. I am using a custom LocationTextExtractionStrategy implementation to extract sensible words from a PDF document. I am calling the method GetSingleSpaceWidth of TextRenderInfo to determine when to join two adjacent blocks of characters into a single word as per the SFO link itext java pdf to text creation

This approach has generally worked well. However, if you look at the attached document, the words "Credit" and "Extended" is giving me some problems. Why are all the characters shown encircled in the screen capture returning a zero value for GetSingleSpaceWidth? This causes a problem . Instead of two separate words, my logic returns me one word "CreditExtended".

I understand that itextsharp5 is not supported any more. Any suggestions would be highly appreciated?

Sample document

https://drive.google.com/open?id=1pPyNRXvnUyIA2CeRrv05-H9q0sTUN97d

Sau001
  • 1,451
  • 1
  • 18
  • 25
  • 1
    I'll have a look at your file later this week. One possible cause would be that the font in question does not have a space glyph at all and that the distance between "Credit" and "Extended" is achieved by explicitly moving the text insertion point. – mkl Jul 03 '19 at 20:13

1 Answers1

1

As already conjectured in a comment, the cause is that the font in question does not contain a regular space glyph, or even more exactly, does not map any of its glyphs to the Unicode value U+0020 in its ToUnicode map.

If a font has a ToUnicode map, iText uses only the information from that map. Thus, iText does not identify a space glyph in that font, so it cannot provide the actual SingleSpaceWidth value and returns 0 instead.


The font in question is named F5 and has this ToUnicode map:

/CIDInit /ProcSet findresource begin
14 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<0004> <0041>
<0012> <0043>
<001C> <0045>
<002F> <0049>
endbfchar
1 beginbfrange
<0044> <0045> <004D>
endbfrange
13 beginbfchar
<0102> <0061>
<0110> <0063>
<011A> <0064>
<011E> <0065>
<0150> <0067>
<015D> <0069>
<016F> <006C>
<0176> <006E>
<017D> <006F>
<0189> <0070>
<018C> <0072>
<0190> <0073>
<019A> <0074>
endbfchar
5 beginbfrange
<01C0> <01C1> <0076>
<01C6> <01C7> <0078>
<0359> <0359> [<2026>]
<035A> <035B> <2018>
<035E> <035F> <201C>
endbfrange
1 beginbfchar
<0374> <2013>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

As you can see, there is no mapping to <0020>.


The use of fonts in this PDF page is quite funny, by the way:

Its body is (mostly) drawn using Calibri, but it uses two distinct PDF font objects for this, F4 which uses WinAnsiEncoding from character 32 through 122, i.e. including the space glyph, and F5 which uses Identity-H and provides the above quoted ToUnicode map without a space glyph. Each maximal sequence of glyphs without gap is drawn separately; if that whole sequence can be drawn using F4, that font is used, otherwise F5 is used.

Thus, CMI, (Credit, and sub-indexes are drawn using F4 while I’ve, “Credit, and Extended” are drawn using F5.

In your problem string “Credit Extended”, therefore, we see two consecutive sequences drawn using F5. Thus, you'll get a 0 SingleSpaceWidth both for the “Credit t and the Extended” E.

At first glance these are the only two consecutive sequences using F5, so you have that issue only there.


As a consequence you should develop a fallback strategy for the case of two consecutive characters both coming with a 0 SingleSpaceWidth, e.g. using something like a third of the font size.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for the detailed answer . Just one more advice. If I were to read the same document using the latest version of itext (version 7 I believe ) would I still need to implement the logic which you have suggested ? – Sau001 Jul 05 '19 at 05:03
  • I would *assume* so but I'm not sure. I'd have to test it. – mkl Jul 05 '19 at 08:00