0

I'm getting txt files in both Hebrew and Arabic with a UTF-8 BOM encoding. I need to convert them to a Windows-1255 or Windows-1256 depending on the content.

How can I know, in runtime, the correct encoding to use?

No luck with Mosilla UniversalDetector, nor with any other solution that I've found. Any ideas? (I need to do it with Java 6. Don't ask why...)

  • 1
    Possible duplicate of [What is the most accurate encoding detector?](https://stackoverflow.com/questions/3759356/what-is-the-most-accurate-encoding-detector) – Hans Westerbeek Oct 23 '18 at 12:25
  • Do you mean that despite the UTF-8 BOM the encoding is not UTF-8, or do you mean that it is in UTF-8 and depending on the used scripts they should be converted to either Windows encoding? Or are in either encoding and could be encoded to UTF-8? – Joop Eggen Oct 23 '18 at 12:30
  • @JoopEggen The encoding is UTF-8, But I need to know the correct language (Hebrew or Arabic) in order to convert the files to a new encoding. – Oded Deutch Oct 23 '18 at 12:34

1 Answers1

4

As of java 1.7 the Character class knows of Unicode scripts like Arabic and Hebrew.

int freqs = s.codePoints().map(cp ->
        Character.UnicodeScript.of(cp) == Character.UnicodeScript.ARABIC ? 1
        : Character.UnicodeScript.of(cp) == Character.UnicodeScript.HEBREW ? -1
        : 0).sum();

For java 1.6 the directionality might be sufficient, as there is a RIGHT_TO_LEFT and a RIGHT_TO_LEFT_ARABIC:

    String s = "אבגדהאבגדהصضطظع"; // First Hebrew, then Arabic.
    int i0 = 0;
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        i += Character.charCount(codePoint);
        boolean rtl = Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT;
        boolean rtl2 = Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC;
        System.out.printf("[%d - %d] '%s': LTR %s %s%n",
                i0, i, s.substring(i0,  i), rtl, rtl2);
        i0 = i;
    }

[0 - 1] 'א': LTR true false
[1 - 2] 'ב': LTR true false
[2 - 3] 'ג': LTR true false
[3 - 4] 'ד': LTR true false
[4 - 5] 'ה': LTR true false
[5 - 6] 'א': LTR true false
[6 - 7] 'ב': LTR true false
[7 - 8] 'ג': LTR true false
[8 - 9] 'ד': LTR true false
[9 - 10] 'ה': LTR true false
[10 - 11] 'ص': LTR false true
[11 - 12] 'ض': LTR false true
[12 - 13] 'ط': LTR false true
[13 - 14] 'ظ': LTR false true
[14 - 15] 'ع': LTR false true

So

int arabic(String s) {
    int n = 0;
    for (char ch : s.toCharArray()) {
        if (Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC) {
            ++n;
            if (n > 1000) {
                break;
            }
        }
    }
    return n;
}
int hebrew(String s) {
    int n = 0;
    for (char ch : s.toCharArray()) {
        if (Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT) {
            ++n;
            if (n > 1000) {
                break;
            }
        }
    }
    return n;
}

if (arabic(s) > 0) {
    return "Windows-1256";
} else if (hebrew(s) > 0) {
    return "Windows-1255";
} else {
    return "Klingon-1257";
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138