5

I am new to C++ and come from non-CS background. Hence kindly excuse me if this question is silly or has been answered before.

I have a string in c++, language is Telugu.

std::string str = "ఉంది"; // (it means exists; pronounced as Vundi)
std::string substring = str.substr(0,3);

The above substring would be "ఉ" (pronounced as Vu) and its unicode hex value is 0C09.

How can i get the value 0C09 from substring? The purpose is to check if the substring is in the valid range for Telugu (0C00–0C7F).

I have seen other questions they apply to obj-c , java, php, c# etc. I am looking specifically for c++ using std::string.

As per the comment I have read the article at joelonsoftware.com/articles/Unicode.html.

Let me update my question with more information. I am using Fedora 19 x86_64 and encoding is UTF-8. The console is able to display the text properly.

As per the article, if I understand correctly ASCII is single byte character and unicode is multibyte character. The above code sample reflects that, here it is 3 bytes in length for each unicode character. Other than talking about UTF-8/ text encoding and multibyte characters, this article offers no practical help in detecting the language of unicode string.

May be I should rephrase my question:

How can I detect a language for unicode string in C++?

Thanks in advance for help.

user3014442
  • 53
  • 1
  • 5
  • 2
    It looks like you need to learn about text encodings. This is a decent article on the topic: http://www.joelonsoftware.com/articles/Unicode.html Understanding this article will make it *so* much easier to handle the problem you are facing. I recommend it :) – Magnus Hoff Nov 20 '13 at 18:46
  • Thanks for the information and prompt reply. I would go through the article. – user3014442 Nov 20 '13 at 18:51

3 Answers3

1

using string the result that i get is

std::string str = "ఉంది"; // (it means exists; pronounced as Vundi)
unsigned short i =str[0];
printf("%x %d",i,i);

The output is "ffeo 65504"

But when i use wstring i.e

std::wstring str = L"ఉంది"; // (it means exists; pronounced as Vundi)
unsigned short i =str[0];
printf("%x %d",i,i);

The output is "c09 3081" which i suppose is the right output. I am not sure but is that what you want.Let me know

kunal
  • 956
  • 9
  • 16
0

You could either use ICU or you would have to convert UTF-8 to UTF-16/32 by hand by looking at consecutive chars in the string. See here for an explanation for UTF-8 multi byte chars.

ICU also includes unicode character properties, which may be helpful e.g. for detecting scripts.

std::string does not have any built-in support for UTF-8 to UTF-16/32 conversion, so also substr can not return a unicode character.

Neet
  • 3,937
  • 15
  • 18
  • I agree with you. I was not very keen on using external libraries, sorry should have mentioned this. And do not need any of those special properties and internationalization beyond hex value of character. – user3014442 Nov 22 '13 at 07:09
  • As @Neet mentioned, ICU also has "exemplar chars" ( which chars are actually used by the Telugu language ), `UnicodeSet` (for performing the chars-in-range operation) as well as character props. These give you a lot of tools for "detecting what languages a string might be", short of full blown linguistic analysis. ICU was written so that these operations would be available in a consistent cross platform way. One might say "not an external library!" or "it's too big!" but, it takes work to get this right… – Steven R. Loomis Jan 06 '14 at 21:51
0

You need to convert from your encoding (utf8 probably)(char *) to wide char (wchar_t).

You can see this post or this one for more information about this conversion.

Community
  • 1
  • 1
INS
  • 10,594
  • 7
  • 58
  • 89