0

I'm trying to develop a small Windows application to improve my C++ skill outside MFC framework and to help my studies about foreign languages.
I would like to make a small, personal and easy-to-port_and_use dictionary and, while I have no problems in developing the GUI, I'm having real pain in saving and restoring data.

My idea is to write down a binary files structured as follow:

int (representing the number of words)
int (representing the string length + \0)
sequence of characters zero-terminated.
Now, I'm learning russian and my primary language is italian, so I can't use plain old std::string to write down words, moreover, thank you Microsoft, I'm using VS2010 with all the goods and bads that come with it. I'm showing you my routines to write down int and wstring:
//Writing int
void CDizionario::ScriviInt( int nInt, wofstream& file ) const
{
    file.write( reinterpret_cast < const wchar_t * > ( &nInt ), sizeof( nInt ) );
    file.flush();
}
// Writing string
void CDizionario::ScriviWString( int nLStringa, const wstring* pStrStringa, wofstream& file ) const
{
    wchar_t cTerminatore;
    string strStringa;
    file.write( pStrStringa->c_str(), nLStringa );
    file.flush();
    cTerminatore = L'\0';
    file.write( &cTerminatore, sizeof( wchar_t ) );
    file.flush();
}
// Reading int
void CDizionario::LeggiInt( int *pInt, wifstream& file )
{
    file.read( reinterpret_cast < wchar_t * >( pInt ), sizeof( int ) );
}
// Reading wstring
void CDizionario::LeggiWString( int nLStringa, wstring& strStringa, wifstream& file )
{
    wchar_t *pBuf;
    streamsize byteDaLeggere;
    byteDaLeggere = nLStringa;
    pBuf = new wchar_t[(unsigned int)( byteDaLeggere * sizeof( wchar_t ) )];
    file.read( pBuf, byteDaLeggere * sizeof( wchar_t ) );
    strStringa.append( pBuf );
    delete [] pBuf;
}
// Constructor
CDizionario::CDizionario( void )
{
    m_pLoc = new locale( locale::classic(), new codecvt_utf8_utf16 );
}
// Somewhere in my code before calling LeggiInt/ScriviInt/LeggiWString/ScriviWString:
// ...
file.imbue( *m_pLoc );

Well, my first test has been: ciao - привет, result:

01 00 ee bc 90 22 05 00 ee bc 90 22 63 69 61 6f
00 ec b3 8c 07 00 ee bc 90 22 d0 bf d1 80 d0 b8
d0 b2 d0 b5 d1 82 00 ec b3 8c
Numbers are read correctly, the problem comes when I write down strings: I'd expect that ciao (63 69 61 6f 00 ec b3 8c) was written in 10 bytes (wchar_t size) and not in 5, as happens for russian translation ( d0 bf d1 80 d0 b8 d0 b2 d0 b5 d1 82 00 ec b3 8c).
Obviously I'm missing something, but I can't figure what it is. Can you guys help me out? Also, if you know a better approach to solve the problem, I'm open minded.

EDIT: SOLUTION

Following the first of the two method presented by @JamesKanze, I've decided to sacrify some portability and let the system do my homework:

void CDizionario::LeggiInt( int *pInt, ifstream& file )
{
    file.read( reinterpret_cast( pInt ), sizeof( int ) );
}

void CDizionario::LeggiWString( int nLStringa, wstring& strStringa, ifstream& file ) { char *pBuf; streamsize byteDaLeggere; wstring_convert> converter; byteDaLeggere = nLStringa; pBuf = new char[byteDaLeggere]; file.read( pBuf, byteDaLeggere ); strStringa = converter.from_bytes( pBuf ); delete [] pBuf; }

void CDizionario::ScriviInt( int nInt, ofstream& file ) const { file.write( reinterpret_cast( &nInt ), sizeof( nInt ) ); file.flush(); } void CDizionario::ScriviWString( const wstring* pStrStringa, ofstream& file ) const { char cTerminatore; string strStringa; wstring_convert> converter; strStringa = converter.to_bytes( pStrStringa->c_str() ); ScriviInt( strStringa.length() + 1, file ); file.write( strStringa.c_str(), strStringa.length() ); file.flush(); cTerminatore = '\0'; file.write( &cTerminatore, sizeof( char ) ); file.flush(); }

Community
  • 1
  • 1
IssamTP
  • 2,408
  • 1
  • 25
  • 48

2 Answers2

1

You've not sufficiently specified the format of the binary file. How do you represent an int (how many bytes, big-endian or little-endian), nor the encoding and the format of the characters. The classical network representation would be a big-endian four byte (unsigned) integer, and UTF-8. Since this is something you're doing for your self, you can (and probably should) simplify, using little-endian for integer, and UTF-16LE; these formats correspond to the internal format under Windows. (Note that such code will not be portable, not even to Apple or Linux on the same architecture, and the there is a small chance that the data become unreadable on a new system.) This is basically what you seem to be attempting, but...

You're trying to write raw binary. The only standard way to do this would be to use std::ofstream (and std::ifstream to read), with the file opened in binary mode and imbued with the "C" locale. For anything else, there will (or may) be some sort of code translation and mapping in the std::filebuf. Given this (and the fact that this way of writing data is not portable to any other system), you may want to just use the system level functions: CreateFile to open, WriteFile and ReadFile to write and read, and CloseHandle to close. (See http://msdn.microsoft.com/en-us/library/windows/desktop/aa364232%28v=vs.85%29.aspx).

If you want to be portable, on the other hand, I would recommend using the standard network format for the data. Format it into a buffer (std::vector<char>), and write that; at the other end, read into a buffer, and parse that. The read and write routines for an integer (actually an unsigned integer) might be something like:

void
writeUnsignedInt( std::vector<char>& buffer, unsigned int i )
{
    buffer.push_back( (i >> 24) & oxFF );
    buffer.push_back( (i >> 16) & oxFF );
    buffer.push_back( (i >>  8) & oxFF );
    buffer.push_back( (i      ) & oxFF );
}

unsigned int
readUnsignedInt( 
    std::vector<char>::const_iterator& current,
    std::vector<char>::const_iterator end )
{
    unsigned int retval = 0;
    int shift = 32;
    while ( shift != 0 && current != end ) {
        shift -= 8;
        retval |= static_cast<unsigned char>( *current ) << shift;
        ++ current;
    }
    if ( shift != 0 ) {
        throw std::runtime_error( "Unexpected end of file" );
    }
    return retval;
}

For the characters, you'll have to convert your std::wstring to std::string in UTF-8, using one of the many conversion routines available on the network. (The problem is that the encoding of std::wstring, nor even the size of a wchar_t, is not standardized. Of the systems I'm familiarized, Windows and AIX use UTF-16, most others UTF-32; in both cases with the byte order dependent on the platform. This makes portable code a bit more difficult.)

Globally, I find it easier to just do everything directly in UTF-8, using char. This won't work with the Windows interface, however.

And finally, you don't need the trailing '\0' if you output the length.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • Well, reading this and given the fact that I'm developing a MFC GUI, I guess that porting should not be of interest for now. I'll check your first hint. – IssamTP May 23 '14 at 09:28
  • What about using a small dbms like sqlite, leaving to it the dirt job? – IssamTP May 23 '14 at 09:38
  • Following your first hints, I let the system decide how the bytes have to be converted. I'll update the question with solution for those who are interested. – IssamTP May 23 '14 at 12:32
0

@IssamTP, привет

As mentioned by @James Kanze, working with foreign non-latin languages inevitably pushes you to per-byte format conventions and locales. So it may be worth to not re-invent the wheel and use existing technologies like XML (so the technology will serve the nuances and encode/decode non-latin chars properly).

Yury Schkatula
  • 5,291
  • 2
  • 18
  • 42
  • привет @YuriSchkatula, do you have a link to one of these libs at hand? – IssamTP May 23 '14 at 09:40
  • This is definitely the solution I would recommend in a production environment, where portability is important, and the data structures are complex and evoluative. For what he's trying to do, they may be overkill, however: interfacing to Xerces, for example, will entail more work that what he needs to implement the little bit he needs by hand. – James Kanze May 23 '14 at 09:57
  • 1
    @IssamTP, you can count on MS XML http://www.microsoft.com/en-us/download/details.aspx?id=3988 or Xerces or TinyXML, for example. There are various libs so it may be worth for future experience at least to try them. – Yury Schkatula May 23 '14 at 12:58
  • @YurySchkatula большое спасибо. Thank you very much, I'll check on these stuffs. – IssamTP May 23 '14 at 14:18