5

For some reason I can not read data from a xml file properly. For example instead of "Schrüder" I get something like "Schrüder".

My code:

tinyxml2::XMLDocument doc;

bool open(string path) {
    if(doc.LoadFile(path.c_str()) == XML_SUCCESS)
        return true;
    return false;
}



int main() {
    if(open("C:\\Users\\Admin\\Desktop\\Test.xml"))
    cout << "Success" << endl;

    XMLNode * node = doc.RootElement();
    string test = node->FirstChild()->GetText();

    cout << test << endl;
    return 0;
}

Part of XML:

<?xml version="1.0" encoding="UTF-8"?>
<myXML>
    <my:TXT_UTF8Test>Schrüder</my:TXT_UTF8Test>
</myXML>

Notice that if I convert it to ANSI and change the encoding type to "ISO-8859-15" it works fine.

I read that something like "LoadFile( filename, TIXML_ENCODING_UTF8 )" should help. However that's not the case (error: Invalid arguments, it just expects a const char). I have the latest version of TinyXML2 (I guess?). I downloaded it just a couple minutes ago from https://github.com/leethomason/tinyxml2.

Any ideas?

Edit: When I write the string to a .xml or .txt file it works fine. There might be some problem with the eclipse ide console. Anyway, when I try to send the string via E-Mail, I also get the same problems. Here's the MailSend script:

bool sendMail(std::string params) {

    if( (int) ShellExecute(NULL, "open", "H:\\MailSend\\MailSend_anhang.exe", params.c_str(), NULL, SW_HIDE) <= 32 )
        return false;
    return true;

}

I call it in the main method like this:

sendMail("-f:d.nitschmann@example.com -t:person2@example.com -s:Subject -b:Body " + test);
FRules
  • 739
  • 1
  • 10
  • 20
  • While it doesn't answer your question, [pugixml](http://pugixml.org/) claims they have good Unicode support. – Rapptz Jul 16 '13 at 12:12
  • TinyXML too claims to be naturally supporting UTF-8; are you sure the file you are loading is UTF-8 encoded? – nikolas Jul 16 '13 at 12:15
  • Yes, it definitely is. – FRules Jul 16 '13 at 12:33
  • Is it properly UTF-8 encoded? Does it have a BOM? (what are the first raw bytes of the file in question in binary) – Yakk - Adam Nevraumont Jul 16 '13 at 12:34
  • Well, I'm not really sure what you mean with "properly". I opened the XML file with notepad++ and clicked on "Encoding -> UTF8 without BOM" and "Encoding -> Convert to UTF-8 without BOM". Can you tell me where I can see the first raw bytes? Do I need a special editor for that? – FRules Jul 16 '13 at 12:43
  • It would be useful to add the snippet of offending XML in your post. – doron Jul 16 '13 at 13:09
  • Okay, check now again. Does this help you? By the way: If I write it to a .txt or .xml file it works also fine. Maybe the eclipse console just can't handle it. If I try to send the string via E-Mail to me I get the wrong encoding again though. I'm not sure if this has to do with ShellExecute or the mail tool. – FRules Jul 16 '13 at 13:47
  • @Yakk - BOM is neither required nor recommended for UTF-8 - and altogether pointless for UTF-8 - but should nonetheless be handled correctly if it is there. Not sure if you think it should be there or think that it should not - just though I should clarify. – Iwan Aucamp Jul 16 '13 at 15:11
  • My theory is that the engine is auto-detecting if the text is UTF-8, instead of reading the encoding from the header. Stuff a BOM at the start of the file and see if the problem goes away. However, it seems more likely that `cout` isn't expected UTF-8 encoded `char`? – Yakk - Adam Nevraumont Jul 16 '13 at 15:34

1 Answers1

1

I think the problem is with your terminal; can you try run your test code in a different terminal ? one with known good UTF-8 support ?

Output with terminal in UTF-8 mode:

$ ./a.out 
Success
Schrüder

Output with terminal in ISO-8859-15 mode:

$ ./a.out 
Success
SchrÃŒder

Also - please try and follow http://sscce.org/ - for posterity sake here is your code with everything needed to compile (17676169.cpp):

#include <tinyxml2.h>
#include <string>
#include <iostream>

using namespace std;
using namespace tinyxml2;

tinyxml2::XMLDocument doc;

bool open(string path) {
    if(doc.LoadFile(path.c_str()) == XML_SUCCESS)
        return true;
    return false;
}



int main() {
    if(open("Test.xml"))
    cout << "Success" << endl;

    XMLNode * node = doc.RootElement();
    string test = node->FirstChildElement()->GetText();

    cout << test << endl;
    return 0;
}

compiled with:

g++ -o 17676169 17676169.cpp -ltinyxml2

and uuencoded Test.xml - to ensure exact same data is used

begin 660 Test.xml
M/#]X;6P@=F5R<VEO;CTB,2XP(B!E;F-O9&EN9STB551&+3@B/SX*/&UY6$U,
M/@H@("`@/&UY.E185%]55$8X5&5S=#Y38VARP[QD97(\+VUY.E185%]55$8X
/5&5S=#X*/"]M>5A-3#X*
`
end

Edit 1:

If you want to confirm this theory - run this in eclipse:

#include <iostream>
#include <string>
#include <fstream>

int main()
{
    std::ifstream ifs("Test.xml");
    std::string xml_data((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
    std::cout << xml_data;
}

Output with terminal in UTF-8 mode:

$ ./17676169.cat 
<?xml version="1.0" encoding="UTF-8"?>
<myXML>
    <my:TXT_UTF8Test>Schrüder</my:TXT_UTF8Test>
</myXML>

Output with terminal in ISO-8859-15 mode:

$ ./17676169.cat 
<?xml version="1.0" encoding="UTF-8"?>
<myXML>
    <my:TXT_UTF8Test>SchrÃŒder</my:TXT_UTF8Test>
</myXML>
Iwan Aucamp
  • 1,469
  • 20
  • 21
  • Thanks for your answer. I tried it with another terminal - flawless. Unfortunately it doesn't work in the E-Mail though. – FRules Jul 17 '13 at 06:17
  • Can you elaborate on how email is being sent ? – Iwan Aucamp Jul 17 '13 at 07:19
  • Yeah. The MailSend.exe expects 1 String containing 4 values: From (-f:), To(-t:), Subject(-s:) and Body (-b:). If I write something like "Schrüder" directly in the body, it works. It doesn't work though if I insert a string with data from the xml file (see in the main method above). – FRules Jul 17 '13 at 08:21
  • Can you elaborate also on how exactly you end up with mail sending correctly ? i.e. do you run MailSend.exe from a terminal and then copy the text into the terminal - or is it also invoked via ShellExecute ? If it is in a terminal which one, with what encoding ? Does the output of your program appear correctly in this terminal ? As the above establishes that `string test` does contain UTF-8 - my best guess is that MailSend.exe assumes the character encoding of the terminal - can you try specifying the character set to MailSend.exe explicitly ? – Iwan Aucamp Jul 17 '13 at 08:39
  • Also - where is mailsend from and where can I find documentation for it ? It doesn't seem to be [this](https://code.google.com/p/mailsend/) – Iwan Aucamp Jul 17 '13 at 08:40
  • There's no documentation for MailSend because it's not published. To be honest (and I probably shouldnt say this) I have also no clue how it works, I haven't seen the code yet. I think the main problem is that the string I get from the xml just doesn't get converted. It seems like I have to write my own function that replaces the characters. – FRules Jul 17 '13 at 09:00
  • Cant you rather use something more standard - like [mailsend](https://code.google.com/p/mailsend/) - if it covers your requirements (which it should) ? – Iwan Aucamp Jul 17 '13 at 09:22