1

is it possible to determine if a number in TJ operator represents space between words?

Example: [(Sta)28(ry)-333(Plzenec,)]TJ

Number 28 is not enough for space, otherwise 333 it should be space according to actual font size. Font size is 9.96

1 Answers1

4

First of all please be aware that there is no absolute limit number separating numbers for spaces between words from spaces for kerning. All you can do is develop heuristics which will fail for some documents, usually for very tightly set ones.

Now remember how those numbers are applied when calculating the text replacement tx or ty from the origin of the last character before the number to the origin of the first character thereafter:

screen shot from ISO 32000-1, 9.4.4 Text Space Details

(ISO 32000-1, section 9.4.4 Text Space Details, also discussed here)

Thus, first of all such a number only widens the gap to the next character if it's negative.

Furthermore, the number is applied before the font size is multiplied; thus, one does not have to take the font size into account as I incorrectly claimed in a comment to the question.

The number (scaled by 1/1000) is directly subtracted from the glyphs displacement. So one can compare it with the glyph displacements of the font in question to get an impression of the meaning of the number.

The glyph displacements essentially are the numbers from the corresponding font's Widths or W array (defaulting to the MissingWidth / DW value) scaled by 1/1000. As both the TJ numbers and the widths are scaled by 1/1000, you can directly compare them.

Thus, an obvious option would be to compare the absolute value of negative TJ numbers to the width of the space glyph in the font in question. This differs from font to font, e.g. it's 600 for Courier, 278 for Helvetica, and 250 for Times-Roman.

Spaces between words created by TJ numbers don't necessarily have to be as wide as the full space glyph of the font, but a relevant fraction of it, e.g. half its value (YMMV), can be used as minimum for interpreting a TJ number as a space between words.

Unfortunately, though, if a PDF generator creates all spaces between words by TJ numbers and none by space glyphs, and if the font is embedded as a subset only, there is no need to embed the space glyph at all. In that case you might want to use other glyphs to compare with; often the length of a capital 'M' is used as a measure for the widths of a font, you might want to use a relevant fraction thereof, e.g. one fifth (YMMV again).


You can improve your heuristics

  • by also taking the character spacing value Tc into account: If Tc / Tfs is negative with a relevant absolute value, the text is tightly set. In that case you might want to lessen the limit number determined as above. Or
  • by an analysis of all the TJ numbers in your text or those in the surrounding text. Here I can only guess, though, what might be acceptable heuristics...
mkl
  • 90,588
  • 15
  • 125
  • 265