3

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE:
col-a    col-b    col-c

desired output:
col-a    col-b    col-c

but I am getting the following:
col-a col-b col-c

I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?

user2531191
  • 579
  • 10
  • 27

2 Answers2

9

After a very long research I found the solution. Here are the steps to follow

  1. Upgrade your tesseract to 3.04

  2. Create config.txt (Create a file in the directory where you input the image file)

  3. In config file define "preserve_interword_spaces"

  4. After the work preserve_interword_spaces give either 0 or 1. Ex:

preserve_interword_spaces 0

or

preserve_interword_spaces 1

  1. Test & Cheers!!!
Pavan Pyati
  • 950
  • 2
  • 13
  • 18
  • 1
    Tested and working in 3.05, but the initial spacing (to the left of the first character) won't be preserved. Some extra processing would be required to insert a bogus character left of each column and then remove it in post-processing. – jbass Jul 03 '17 at 23:08
4

Tesseract compresses consecutive spaces into one. You would need to modify baseapi.cpp to preserve the spaces. The code change can be found in the following posts:

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J

Tom Morris
  • 10,490
  • 32
  • 53
nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • I am not a c++ programmer. I failed to open the project in vs 2005 and I do not have vs 2008 to compile the changes. How to get the compiled library? – user2531191 Aug 08 '13 at 01:57
  • 1
    You can still get VC++ 2008 Express. http://stackoverflow.com/questions/15318560/visual-c-2008-express-download-link-dead – nguyenq Aug 08 '13 at 03:38
  • @nguyenq : The code for the function `GetUtf8Text` has changed in the `baseapi.cpp` module, could you explain how to achieve the same in the current installation of tesseract-ocr – aspiring1 Sep 05 '19 at 11:49
  • No need to modify the source code as support for `preserve_interword_spaces` variable was added. – nguyenq Sep 07 '19 at 14:53
  • 1
    @nguyenq : But,the option of `-c preserve_interword_spaces=1 ` via the _command line_, doesn't help the **right aligned** text, which converts to **left aligned**, as given in this [answer](https://stackoverflow.com/a/35660676/8030107) . Also, tabular data columns get concatenated almost, I want the columns to look separate and distinct for viewing purposes. – aspiring1 Sep 10 '19 at 13:29
  • @aspiring1 Can you try with various PSM to see if it produces better results? Or put in a ticket with [Tesseract](https://github.com/tesseract-ocr/tesseract/issues) explaining deficiencies in current implementation; they may do something about it. – nguyenq Sep 14 '19 at 14:35