0

PDF documents don't require space characters to be present in the page content streams to visually break words. As a consequence, a glyph for the space character may be missing as well in font programs. PDF compliant viewers appear to use font metrics and text state to infer an appropriate word spacing width and check it against characters positioning to add missing spaces when selecting/copying text. Unfortunately the PDF specification appears to not stress enough how word spacing width can be computed in such cases. While pdf.js appears to hard code a size for tracking word breaks, from my empirical tests it seems a different approach is used by Acrobat Reader/Pro. What it could be such heuristic?

ceztko
  • 14,736
  • 5
  • 58
  • 73
  • You are asking to know what the internal logic of Acrobat is? Why is knowing how Acrobat does it important for you? If you got that info what would you do with it? – Ryan Aug 13 '22 at 18:05
  • Yes, or an alternative logic that is better than hard coding a fixed value for all fonts as done in pdf.js . I would use it to implement it in a PDF manipulation library. Acrobat is the PDF reference implementation so I am assuming their heuristics tend to be normative. – ceztko Aug 13 '22 at 18:56
  • 1
    For sure not normative. But actually quite good. But these heuristics are implemented in their proprietary code... – mkl Aug 13 '22 at 22:56
  • If not "normative" at least "trusted", in the sense that other implementations will tend to follow Acrobat. Of course the exact heuristic used in Acrobat is not publicly available but it could be discovered, or we could find something similar. I tried some approaches, like taking half of the smaller metrics in the font program (or in the /W array), or consider side bearing, but they didn't work with some test cases. Maybe the spacing is a fraction of the average glyphs' width: it's a simple approach but I didn't try so far and asked the experts first :) – ceztko Aug 14 '22 at 07:07
  • 1
    I *think* (I definitively don't *know*) that it's not that simple. I consider it most likely that there is some mixed strategy that takes multiple aspects into account, and differently in different document depending on a recognized type of typesetting strategy in the stream. Over the years they surely collected a large corpus of documents to improve such a diversified strategy with. – mkl Aug 14 '22 at 10:13
  • @mkl what you say is certainly true for the whole text extraction algorithm, which is surely complex. Considering only horizontal scripts and excluding some peculiar features (eg. combining marks), I believe an accurate strategy to break inline words is not terribly hard and mostly involves determining the spacing width and tracking glyphs positions. The question here is how to better infer the spacing width. Because I don't expect anyone to do the research for me, I leave the question open for the future: at some point someone may come with a better strategy than the pdf.js hard-coding. – ceztko Aug 15 '22 at 18:14
  • @mkl I attempt to [answer](https://stackoverflow.com/a/73420359/213871) my own question. The question is very technical and I probably shouldn't have asked here in SO, but actually I hoped you had some clues since I consider you the most expert/helpful guy here in SO with regards to PDF specification inquiries :) – ceztko Aug 19 '22 at 17:37
  • Thanks. But I usually want to not have to try and reverse engineer Adobe Acrobat. In contrast to the time before ISO 32000-1, that software does not define the PDF standard anymore, it merely is one possible implementation of it. Consequentially one should not have to mimic its behavior but instead look at the standard and based thereupon have one's own ideas. – mkl Aug 19 '22 at 21:41
  • @mkl I wanted to point you that now the text extraction [code](https://github.com/podofo/podofo/blob/master/src/podofo/base/PdfPage_TextExtraction.cpp) I was mentioning is now part of PoDoFo library. It clearly misses some special script features and heuristics are insufficient to handle text continuity in all cases, but it's a good start point and has some advanced features already checked in – ceztko Jan 07 '23 at 12:26

1 Answers1

0

The question is very technical and answering it requires either having some insider knowledge of Adobe Acrobat internals or having implemented text extraction in PDF documents with a robust set of test cases that were compared against Adobe results. To whom it may concern, assuming a robust words break algorithm for text extraction can be implemented by inferring an arbitrary spacing width and comparing against glyphs location, the heuristic I'm currently testing is the following:

unscaledSpacingWidth = (average of non zero glyph widths obtained from /W or /Widths arrays) / 7

Where 7 is an arbitrary constant which seems to work well and match Adobe Acrobat results close enough in a limited set of samples I tested. This compares against the solution in pdf.js which is just picking an hard-coded value of 0.1 PDF points.

The found spacing width is subjected to scaling according to font size and other text state context.

ceztko
  • 14,736
  • 5
  • 58
  • 73