PDF and words
The Portable Document Format (PDF) does not know the concept of words, or at least it does not require textual content to be clearly arranged as words.
(There is one feature, word spacing, which only works if one uses a clearly identified space glyph to separate glyph groups which make up individual words, but this feature is not used that often.)
Thus, to recognize words in PDFs one indeed has to analyze the glyphs in them and their positions.
PDFBox and words
The PDFTextStripper
base parses the content and separately reports each glyph rendered via the processTextPosition
methods. The default implementation of that method then collects these individual glyph data with some treatment of glyphs at the same position.
When all of a page is parsed, the collected data are arranged into lines (after sorting if SortByPosition
is true
) which then are broken into words according to a number of heuristics which in turn are forwarded to writeString
which writes the word into a buffer the content of which eventually is returned as extracted text.
(This is somewhat simplified but should suffice for the question at hand.)
Thus, those two mentioned methods are the main code positions to override with own code.
- One overrides
processTextPosition
if one
- wants the glyph characters in the raw order of their appearance in the stream instead of sorted and arranged, or if one
- needs to access and react to the state of the parsed stream at the moment the glyph is rendered.
- On the other hand one overrides
writeString
if one is interested in the sorted and arranged glyph characters.
For some tasks one actually needs to override both, e.g. like in this answer.
Example for PDFBox & Java
A simple implementation in PDFBox & Java (in a comment the OP mentioned that this could help him, too) might look like this
String extractWordLocations(PDDocument document) throws IOException
{
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
super.writeString(text, textPositions);
TextPosition firstProsition = textPositions.get(0);
TextPosition lastPosition = textPositions.get(textPositions.size() - 1);
writeString(String.format("[%s - %s / %s]", firstProsition.getXDirAdj(), lastPosition.getXDirAdj() + lastPosition.getWidthDirAdj(), firstProsition.getYDirAdj()));
}
};
stripper.setSortByPosition(true);
return stripper.getText(document);
}
(From ExtractText.java)
Applying it like this to your example file
try ( InputStream documentStream = getClass().getResourceAsStream("mathml88.pdf" );
PDDocument document = PDDocument.load(documentStream))
{
String wordLocations = extractWordLocations(document);
System.out.println("\n'mathml88.pdf', extract with word locations:");
System.out.println(wordLocations);
System.out.println("***********************************");
}
(ExtractText test method testExtractWordLocationsFromMathml88
)
results in
88[74.34 - 85.2491 / 61.241028] Chapter[378.835 - 413.37317 / 61.241028] 3.[416.10043 - 424.28226 / 61.241028] Presentation[429.73682 - 483.67136 / 61.241028] Markup[486.39862 - 520.93677 / 61.241028]
3.4.3.3[74.34 - 104.34002 / 97.10602] Examples[120.70367 - 163.72914 / 97.10602]
The[74.34 - 91.30365 / 117.565] msubsup[93.55299 - 133.57849 / 117.565] is[135.816 - 143.09236 / 117.565] most[145.33963 - 166.55782 / 117.565] commonly[168.80508 - 215.47418 / 117.565] used[217.72145 - 237.71782 / 117.565] for[239.976 - 252.69601 / 117.565] adding[254.94328 - 284.63788 / 117.565] sub/superscript[286.88516 - 352.93976 / 117.565] pairs[355.18704 - 376.39432 / 117.565] to[378.6416 - 387.12888 / 117.565] identifiers[389.37616 - 433.0125 / 117.565] as[435.2598 - 444.34708 / 117.565] illustrated[446.60526 - 490.23068 / 117.565] above.[492.48886 - 520.9398 / 117.565]
However,[74.34 - 115.90368 / 131.11401] another[118.88187 - 151.59825 / 131.11401] important[154.56552 - 196.991 / 131.11401] use[199.96918 - 214.511 / 131.11401] is[217.48918 - 224.76555 / 131.11401] placing[227.73282 - 259.8492 / 131.11401] limits[262.8274 - 287.68918 / 131.11401] on[290.66736 - 301.57648 / 131.11401] certain[304.54376 - 334.22736 / 131.11401] large[337.20554 - 358.81644 / 131.11401] operators[361.78372 - 402.37646 / 131.11401] whose[405.35464 - 433.22742 / 131.11401] limits[436.2056 - 461.06738 / 131.11401] are[464.03467 - 477.35464 / 131.11401] tradition-[480.33282 - 520.93646 / 131.11401]
ally[74.34 - 90.70365 / 144.664] displayed[93.812744 - 135.62732 / 144.664] in[138.74731 - 147.23459 / 144.664] the[150.34369 - 163.6746 / 144.664] script[166.7837 - 191.02373 / 144.664] positions[194.14372 - 233.54736 / 144.664] even[236.65646 - 256.8165 / 144.664] when[259.9256 - 283.55472 / 144.664] rendered[286.6747 - 324.82382 / 144.664] in[327.94382 - 336.4311 / 144.664] display[339.5402 - 371.05658 / 144.664] style.[374.16568 - 397.5002 / 144.664] The[400.6202 - 417.58386 / 144.664] most[420.69296 - 441.91116 / 144.664] common[445.02026 - 483.20212 / 144.664] of[486.3221 - 495.4094 / 144.664] these[498.5185 - 520.93665 / 144.664]
is[74.34 - 81.61636 / 158.21301] the[84.343636 - 97.67456 / 158.21301] integral.[100.40183 - 136.29279 / 158.21301] For[139.02007 - 154.00917 / 158.21301] example,[156.73645 - 196.26012 / 158.21301]
?[120.703995 - 126.847725 / 193.42804] 1[131.73799 - 136.22119 / 196.88904]
ex[138.04799 - 147.33707 / 208.36603] dx[149.18999 - 160.47609 / 208.36603]
0[126.83699 - 131.32019 / 217.77405]
As you see an expression "[xstart - xend / y]" is attached to each word.
Putting all the information into a String
is for proof-of-concept purposes only. For production use you may instead want to create a WordWithPosition
class, create an instance of that class for each word in writeString
and store those objects in a List
the content of which you eventually retrieve from your PDFTextStripper
extension.