0

Consider an input string provided by the user, containing at least one character from exactly one writing system (e.g. latin, cyrillic, greek, arabic, hebrew, chinese, japanese, korean...)

Is it possible to detect which writing system was used? Would I have to go through some Unicode decoding and then Unicode pages, or is there some function which does that for me?

Alexander
  • 19,906
  • 19
  • 75
  • 162
  • 1
    What about getting the Encoding method? – jdweng Jan 28 '19 at 15:36
  • 3
    Your best bet is to pass it to a language detection algorithm. A quick google revealed https://detectlanguage.com/ – Murray Foxcroft Jan 28 '19 at 15:37
  • @jdweng What do you mean? – Alexander Jan 28 '19 at 15:42
  • Can you give an example of input string, I am curious what you call "a char"? And yes, unicode is [solving exactly this problem](https://en.wikipedia.org/wiki/Universal_Character_Set_characters) with formalization of various writing systems into one easy to use encoding. Though I have no clue about rules, naively there are ranges, but I only speak using 2 systems (4 languages), all are [alphabetical](https://en.wikipedia.org/wiki/Writing_system). – Sinatr Jan 28 '19 at 15:48
  • 2
    The Unicode character database has a [`Script` property](https://www.unicode.org/reports/tr24/) that exposes this per character -- with some caveats, of course. This is not natively surfaced in .NET, but the UCD is quite accessible. The script data is [here](https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt). Things like Han unification can complicate things, if what you're really after is more the language than the script. – Jeroen Mostert Jan 28 '19 at 15:57
  • @Sinatr Unicode does not solve my problem. I have different fonts (effectively by just providing the font name to a 3rd party lib, so I cannot try to detect whether the font really has the character I want to render) that are not covering the full Unicode code space, and I now have to select which font to use based on the Unicode characters that should be displayed. – Alexander Jan 28 '19 at 15:59
  • See remarks on following msdn webpage. When you have a Writing System there is a culture associated with the system which includes the encoding method : https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo.getcultureinfo?view=netframework-4.7.2 – jdweng Jan 28 '19 at 16:00

2 Answers2

0

You can try to use the google API to detect the language: here

All the credits and how to use it here!

Mikev
  • 2,012
  • 1
  • 15
  • 27
0

Try an extension like it for Unicode check

public static class StringExtension
{
    public static bool IsUnicodeCharacterInIt(this string value)
    {
        return value.Any(c => c > 255);
    }
}

public void Check()
{
    var unicodeString = "سلام بیبی";
    var nonUnicodeString = "hi baby";

    var result1 = unicodeString.IsUnicodeCharacterInIt();
    var result2 = nonUnicodeString.IsUnicodeCharacterInIt();
}
NaDeR Star
  • 647
  • 1
  • 6
  • 13
  • 1
    So was it *"latin, cyrillic, greek, arabic, hebrew, chinese, japanese, korean..."* or which one? – Sinatr Jan 28 '19 at 15:49
  • 1
    This seems to operate under the widespread misconception that all characters that fit in 8 bits are somehow "ASCII" or "not Unicode", which is just not true. It's not even true that all characters that fit in Latin character sets occur there, so I can't think of any circumstance where this check is useful. – Jeroen Mostert Jan 28 '19 at 15:52