1

For some reason itextsharp is now reading pdf which contains numbers such as 4123 as 4*23 where the * is actually a an arrow pointing up. Not sure why this is happening. Please help.

Thanks.

Sample file is located here: https://dl.dropboxusercontent.com/u/116833/SAMPLE%20PDF.pdf

Steven Marcus
  • 393
  • 3
  • 15

1 Answers1

5

The reason for the arrows is that the file actually tries to mislead text extractors which extract text according to the guidelines of Section 9.10.2 Mapping Character Codes to Unicode Values of the PDF specification ISO 32000-1 while not confusing those which prefer ActualText marked-content sequence entries: The former method is lead to believe the '3's are arrows while the latter is told the '3's are threes.

Most likely this is done to prevent automated text extraction while allowing manual copy&paste because Adobe Reader does prefer the ActualText marked-content sequence entries (thus, manual extraction works all right) while many programmatic extractors prefer the former method.

As far as I read the relevant sections of the specification, it prefers neither way over the other.

Details

E.g. look at the first part number: First part number

BT
/T1_1 1 Tf
10 0 0 10 69.1456 750.2834 Tm
(1 )Tj
ET
EMC 
/Span <</MCID 14 >>BDC 
BT
/T1_1 1 Tf
10 0 0 10 89.5488 750.2834 Tm
(2)Tj
/Span<</ActualText<FEFF0033>>> BDC 
(3)Tj
EMC 
(412109 )Tj
ET
EMC 

As you see the '3' is marked with an ActualText entry indicating that it is a three indeed (<FEFF0033> is a long way to indicate the Unicode digit three).

The font T1_1, on the other hand, offers a ToUnicode stream containing the mapping

...
<30> <0030>
<31> <0031>
<32> <0032>
<33> <0018>
<34> <0034>
<35> <0035>
...

As you see while other digits (0x30 is '0', 0x31 is '1', ... , 0x39 is '9') are mapped identically, the '3', i.e. 0x33, is mapped to the Unicode code point 0x0018, and

U+0018 is the Unicode hex value of the character <control>, which is categorized as "control character" in the Unicode 6.0 character table.

"<control>" was previously named "CANCEL" in older versions of Unicode.

(cf. http://www.marathon-studios.com/unicode/U0018/Control)

In some context this control character is displayed as an upwards arrow.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for the long explanation. Do you have a suggestion on how to extract the 23412109 as 23412109 instead of having symbols in it. – Steven Marcus Mar 27 '14 at 20:05
  • Create a new text extraction strategy which considers marked content. If I remember correctly strategies have some access to marked content properties. If this access suffices for your task, it should be only a medium problem to use the **ActualText** entries. – mkl Mar 27 '14 at 20:44
  • How do I do that. I've tried a few different strategies. Current I am using : strategy = new FilteredTextRenderListener(new SimpleTextExtractionStrategy(), filter); – Steven Marcus Mar 28 '14 at 12:07
  • You will actually have to program one, e.g. by copying the source of an existing one and changing it. The `TextRenderInfo` object a strategy gets as information concerning a piece of text has some marked content information which is partially private, though. You, therefore, might need to trick a bit. – mkl Mar 28 '14 at 14:01
  • Any suggestions? Not sure how to do that – Steven Marcus Mar 28 '14 at 19:02
  • I'd have to experiment before being able to say anything more specific. As mentioned in my prior comment, `TextRenderInfo` carries some marked content information, but the details of using them require some time. – mkl Apr 01 '14 at 08:16
  • 1
    Related: [PDFBOX-4532](https://issues.apache.org/jira/browse/PDFBOX-4532). – mkl May 06 '19 at 14:36