Android Tess-Two OCR unmappable character 'ﬁ'

Question

My android app has an OCR functionality using tess-two library. I have this issue in reading the String which contains "fi". After baseApi.getUTF8Text(), a method to get the recognized text by the OCR, the returned String in that "fi" is "ﬁ" <<<- - - Take a very close attention to that string. It is not a 2-charactered String but a single-charactered String. You can reproduce that by copying and pasting. Now, I am thinking it might be an issue of UTF8 encoding or etc which I don't have enough knowledge with. When I tried to do string.replace("ﬁ","fi"), Android Studio builds with erors unmappable character for encoding utf-8. I tried searching in google but it recognize it as a regular "fi" not "ﬁ".

Is there any way I can fix this character?

score 6 · Accepted Answer · edited May 23 '17 at 12:29

6

You can avoid recognizing the ﬁ ligature by blacklisting it before calling baseApi.setImage:

baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "ﬁ");

To prevent Android Studio from throwing the unmappable character error on your java code, convert your file encoding to UTF-8 by choosing "UTF-8" from the selector near the bottom right corner of the Android Studio window.

edited May 23 '17 at 12:29

Community

1
1

answered Sep 05 '15 at 02:34

rmtheis

5,992
12
61
78

1

So far this is fine :) I knew how the blacklist works but I never considered putting that character there because I thought It'l be a question mark when built. – Sheychan Sep 07 '15 at 01:15

score 2 · Answer 2 · answered Sep 03 '15 at 04:00

2

Here's what I found, FWIW: the character 'ﬁ' is a ligature (more at: Unicode Character 'LATIN SMALL LIGATURE FI' (U+FB01))

Here's a quick and dirty program to find and replace 'ﬁ' with any other characters:

public class LigatureFI
{

    static char ligature_fi = 0xFB01;

    public static void main(String[] args)
    {
        String sligature_fi = Character.toString(ligature_fi);
        String string = new String("ﬁﬁﬁﬁﬁﬁﬁﬁﬁﬁﬁﬁﬁﬁﬁ");
        System.out.println(string);
        string = string.replaceAll(sligature_fi, "FI");
        System.out.println(string);
    }

}

If your IDE complains about 'ﬁ' not being in the cp1252 charset, save as UTF8.

HTH.

answered Sep 03 '15 at 04:00

user5292387

413
2
7

1

Your method doesn't work, the result is a string of question marks. – Zarwan Sep 03 '15 at 04:05
I think this his happening because `fi` is not a known character. I'm assuming your replace function is not working, so the `fi` is still there and since IntelliJ can't output it properly it's replacing it with a question mark. – Zarwan Sep 03 '15 at 04:16
Method works on my machine, result is "FIFIFIFIFIFIFIFIFIFIFIFIFIFIFI" – user5292387 Sep 03 '15 at 04:22
That is strange. I tried it with '\uFB01' instead, which is the proper way to refer to it in Java and it still didn't work. It's weird because if I copy and paste that in IntelliJ the paste will give the "ﬁ" character, not the coding, so I know that part is right. When I tried `ﬁ == '\uFB01'` it also gave me `true`, but when I tried `string.charAt(0) == '\uFB01'` it gave me false, even though I copied the same character "ﬁ" to make the string. I'm not sure what's going on. – Zarwan Sep 03 '15 at 04:27

Android Tess-Two OCR unmappable character 'ﬁ'

2 Answers2