Tesseract - ambiguity in space and tab

Question

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE:
col-a    col-b    col-c

desired output:
col-a    col-b    col-c

but I am getting the following:
col-a col-b col-c

I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?

Pavan Pyati · Answer 1 · 2016-06-24T09:47:38.143

9

After a very long research I found the solution. Here are the steps to follow

Upgrade your tesseract to 3.04
Create config.txt (Create a file in the directory where you input the image file)
In config file define "preserve_interword_spaces"
After the work preserve_interword_spaces give either 0 or 1. Ex:

preserve_interword_spaces 0

or

preserve_interword_spaces 1

Test & Cheers!!!

edited Jun 24 '16 at 09:47

answered Apr 05 '16 at 13:49

Pavan Pyati

950
2
13
18

1

Tested and working in 3.05, but the initial spacing (to the left of the first character) won't be preserved. Some extra processing would be required to insert a bogus character left of each column and then remove it in post-processing. – jbass Jul 03 '17 at 23:08

score 4 · Accepted Answer · edited Jan 22 '15 at 19:20

4

Tesseract compresses consecutive spaces into one. You would need to modify baseapi.cpp to preserve the spaces. The code change can be found in the following posts:

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J

edited Jan 22 '15 at 19:20

Tom Morris

10,490
32
53

answered Aug 07 '13 at 23:29

nguyenq

8,212
1
16
16

I am not a c++ programmer. I failed to open the project in vs 2005 and I do not have vs 2008 to compile the changes. How to get the compiled library? – user2531191 Aug 08 '13 at 01:57
1

You can still get VC++ 2008 Express. http://stackoverflow.com/questions/15318560/visual-c-2008-express-download-link-dead – nguyenq Aug 08 '13 at 03:38
@nguyenq : The code for the function `GetUtf8Text` has changed in the `baseapi.cpp` module, could you explain how to achieve the same in the current installation of tesseract-ocr – aspiring1 Sep 05 '19 at 11:49
No need to modify the source code as support for `preserve_interword_spaces` variable was added. – nguyenq Sep 07 '19 at 14:53
1

@nguyenq : But,the option of `-c preserve_interword_spaces=1 ` via the _command line_, doesn't help the **right aligned** text, which converts to **left aligned**, as given in this [answer](https://stackoverflow.com/a/35660676/8030107) . Also, tabular data columns get concatenated almost, I want the columns to look separate and distinct for viewing purposes. – aspiring1 Sep 10 '19 at 13:29
@aspiring1 Can you try with various PSM to see if it produces better results? Or put in a ticket with [Tesseract](https://github.com/tesseract-ocr/tesseract/issues) explaining deficiencies in current implementation; they may do something about it. – nguyenq Sep 14 '19 at 14:35

Tesseract - ambiguity in space and tab

2 Answers2

Linked