4

I have a series of ex-PDF documents (scientific/technical) with characters encoded as vector graphics rather than in a font family. How do I convert the vector stream to characters using Open Source solutions?

I am happy for any accounts of successful solutions. These might include:

  • machine learning to discover the original font family
  • writing the stream to a canvas and using OCR
  • heuristics based on reconstructing the characters from the strokes

The characters are probably fairly "simple" (many are sanserif) and I'd be happy with reconstruction into ANSI (chars 32-127)

UPDATE: [for SO readers' info; does not affect bounty]. I have been extracting the vectors from a single example and these consist of a stroke outlining the glyph, so that even simple glyphs such as "I" are "hollow". I suspect this is commonly true of all vector fonts. I have verified that multiple instances of the same character have identical internal coordinates and this could be used for lookup and discrimination between fonts (the minuscule differences will show up in the decimal places). If the fonts scale precisely, and if we have the coordinates of the fonts (copyright allowing) then lookup of their internal coordinates is a powerful approach. I'd be interested if anyone has tried this.

peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217

2 Answers2

3

Your question points out the most successful and well-known solutions to converting vector encodings into characters in the context of unknown formatting and font families. Indeed, all you lack, and all you're asking for, is a solution that re-encodes the stream for an arbitrary (but desirably high) level of quality.

Let's explore each of your candidate approaches in turn, along with their possibilities:

  1. machine learning to discover the original font family

    This paper discusses the topic in more detail. The most common techniques (reference) are to construct a simple support vector machine or perform Bayesian inference for determining the classifications for each character.

    The most common area where you find these techniques used is in spam detection, where the complete body of an email is visually inspected for, for example, ASCII art or spam encoded as image content. Vectorized classification for document reading, not so much after the initial pass.

  2. writing the stream to a canvas and using OCR

    This is the most common technique with software supporting it, because the most common use case is a scanned physical document passed in for visual inspection. This fails to preserve the vector path for classification, relying instead on character recognition by the glyphs on the page.

    Several free solutions exist here, including OCR 4 Linux and the now-free tesseract-ocr. For a more complete list, including feature comparisons, see here.

  3. heuristics based on reconstructing the characters from the strokes

    For the most part, these are derived from machine learning techniques and are encoded into OCR or handwriting recognition software. Because the classification problem of character recognition for an arbitrary stream of characters is inductive in scope, these are usually limited to a specific language used to back the heuristic.

    This technique certainly exists. It's currently in use by tools like Evernote, which allows you to upload your documents for free (up to a point) and performs the vector analysis for you.

Due to the time consumption of the first approach in the context of a known language and likely known set of font families, I recommend pursuing (2) and (3) as your first ports of call. The easiest method would be to get a free Evernote account and upload the documents, purely to see what gets captured.

Best of luck to you. If the current state of the art is insufficient, you may have a useful corner case worth contributing to the field. :)

MrGomez
  • 23,788
  • 45
  • 72
  • 1
    Very useful overview and references. My implementation may be influenced by the ease/difficulty of integrating OCR (I'd like it to be in Java if possible). – peter.murray.rust Apr 06 '12 at 23:46
  • @peter.murray.rust Glad to help. You're probably already exploring it, but according to [this thread](http://stackoverflow.com/questions/1813881/java-ocr-implementation), [Java OCR](http://sourceforge.net/projects/javaocr/) seems like the right place to start. Good luck! :) – MrGomez Apr 06 '12 at 23:48
  • I hadn't come across JavaOCR - this looks like a fantastic part of my toolkit – peter.murray.rust Apr 07 '12 at 07:17
0

Upload the documents to Google Docs. When prompted, make sure "Upload settings" dialog option "Convert text from PDF and image files to Google documents" is checked. The Google Docs Upload or download files help shows OCR is done for .jpg, .gif, .png, .pdf file types. If it doesn't like your PDF format, try converting it to .png or .gif before uploading.

Note: Google's About Optical Character Recognition page mentions "For PDF files, we only look at the first 10 pages when searching for text to extract."

Brian Swift
  • 1,403
  • 8
  • 10