character counting in all languages/encodings?

Question

So I recently got an assignment to make the classic "count occurrence of types of chars". This in itself isn't all too new however the requirement was that it would work with any language/encoding. I feel this is almost impossible to do especially since I had 3 days. For example accented characters can get malformed, this is solvable with iso encoding I added this as a radiobutton option. Because you cannot detect this while looping right? I mean you could maybe make a regex that somehow matches those cases idk. Then there is the case of some codepoints being used for multiple/other cases etc etc.

Am I missing something here? .Net has no builtin way to handle this afaik. My idea to read and test every character and make a dictionary for all the missed or malformed. then add radio buttons for all the latin languages etc. I did not have time to finish this. Was the request to make a program to do this for every language just a trick challenge? This was in c#.

I believe duplicate shows how to do what you are looking for, indeed if you input is denormalized check https://learn.microsoft.com/en-us/dotnet/api/system.string.normalize?view=netframework-4.8 first. — Alexei Levenkov, Feb 23 '20 at 05:23
Not sure if you can still see this since it is closed, but thank you for the extra reference this seems like what I was missing, never heard of graphene etc, excuse me for the double post did not think it would be. — GreatGaja, Feb 23 '20 at 05:37
[Unicode Text Segmentation and Grapheme Clusters](https://unicode.org/reports/tr29/), [Unicode Normalization Forms](http://unicode.org/reports/tr15/) — Jimi, Feb 23 '20 at 05:43

character counting in all languages/encodings?

0 Answers0