1

Consider a file containing Unicode words as follows

آب
آباد
آبادان

if you read right to left, the first character is " آ ".

My first requirement is to read the file line by line. This would be simple.

The second requirement is to read the file line by line from the second character of each line. the result must be something like this

ب
باد
بادان

As you know there are some solutions like std::substr to meet the second requirement but Afaik std::substr does not works well with Unicode Characters.

I need something like this

std::ifstream inFile(file_name);
//Solution for first requirement
std::string line;
if (!std::getline(inFile, line)) {
   std::cout << "failed to read file " << file_name << std::endl;
   inFile.close();
   break;
}
line.erase(line.find_last_not_of("\n\r") + 1);

std::string line2;
//what should be here to meet my second requirement?
//stay on current line      
//ignore first character and std::getline(inFile, line2)) 
line2.erase(line.find_last_not_of("\n\r") + 1);

std::cout<<"Line= "<<line<<std::cout; //should prints آب
std::cout<<"Line2= "<<line<<std::cout; //should prints 

inFile.close();
Rezaeimh7
  • 1,467
  • 2
  • 23
  • 40
  • 5
    C++ currently has no really usable Unicode support. For Unicode handling beyond reading, storing and printing text, use a dedicated library like ICU. – Baum mit Augen Aug 08 '17 at 10:17
  • Your first step is to figure out which encoding your unicode files use. – Sam Varshavchik Aug 08 '17 at 10:17
  • @SamVarshavchik it uses utf8 – Rezaeimh7 Aug 08 '17 at 10:19
  • Since you know it uses UTF-8, then simply skip over the first character. The wikipedia article on UTF-8 explains how UTF-8 characters are encoded, and how to determine the multi-byte sequence that defines a single character. That's it. – Sam Varshavchik Aug 08 '17 at 10:51
  • 1
    ICU is really your best bet because even if you manage to skip one **code point**, one **character** might be multiple code points. E.g. _é_ can be encoded as `LATIN SMALL LETTER E WITH ACUTE' (U+00E9)` or `LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301)` – Khouri Giordano Aug 08 '17 at 16:11

1 Answers1

2

C++11 has unicode conversion routines but they are not very user friendly. But you can make more user friendly functions with them like this:

// This should convert to whatever the system wide character encoding
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

std::string remove_first_char(std::string const& utf8)
{
    std::wstring ws = utf8_to_ws(utf8);
    ws = ws.substr(1);
    return ws_to_utf8(ws);
}

int main()
{
    std::string utf8 = u8"آبادان";

    std::cout << remove_first_char(utf8) << '\n';
}

Output:

بادان

By converting to a fixed with code-point (UCS-2/UTF-32) you can process the string using the normal string functions. There is a caveat though. UCS-2 does not cover all characters of all languages so you may have to use std::u32string and write a conversion function between UTF-8 and UTF-32.

This answer has an example: https://stackoverflow.com/a/43302460/3807729

Galik
  • 47,303
  • 4
  • 80
  • 117