I have a series of ex-PDF documents (scientific/technical) with characters encoded as vector graphics rather than in a font family. How do I convert the vector stream to characters using Open Source solutions?
I am happy for any accounts of successful solutions. These might include:
- machine learning to discover the original font family
- writing the stream to a canvas and using OCR
- heuristics based on reconstructing the characters from the strokes
The characters are probably fairly "simple" (many are sanserif) and I'd be happy with reconstruction into ANSI (chars 32-127)
UPDATE: [for SO readers' info; does not affect bounty]. I have been extracting the vectors from a single example and these consist of a stroke outlining the glyph, so that even simple glyphs such as "I" are "hollow". I suspect this is commonly true of all vector fonts. I have verified that multiple instances of the same character have identical internal coordinates and this could be used for lookup and discrimination between fonts (the minuscule differences will show up in the decimal places). If the fonts scale precisely, and if we have the coordinates of the fonts (copyright allowing) then lookup of their internal coordinates is a powerful approach. I'd be interested if anyone has tried this.