0

I have an XML (in UTF-8). I have to read a value of a std::string variable from it using PugiXML libraries. After reading the value, I am printing it on console but in my actual project, I have to put that value to a PDF (using LibHaru libraries). My MWE is following:

#include <iostream>
#include "pugiconfig.hpp"
#include "pugixml.hpp"

using namespace pugi;

int main()
{   
    pugi::xml_document doc;
    pugi::xml_parse_result result = doc.load_file(FILEPATH);

    xml_node root_node = doc.child("Report");
    xml_node SystemName_node = root_node.child("SystemName");

    std::string strSystemName = SystemName_node.child_value();

    std::cout<<" The name of the system is: "<<strSystemName<<std::endl;

    return 0;
}

I am reading the value of a variable std::string strSystemName from a XML file using Pugixml libraries. After reading the variable I am printing it on screen (in my actual project, I am writing it to a pdf file). Problem: During debugging, I found that the strange characters have been read from the XML file (which is already in UTF-8), which appears if I print the variable on screen or put it to the pdf.

IMPORTANT: Printing to console is not too important. Important is to put it properly to the PDF file which is also using UTF-8 encoding. But I think that storing the variable in std::string is somehow creating problem and therefore the wrone value is passed to the PDF writer.

PS: I am using VS2010 which is without C++11.

skm
  • 5,015
  • 8
  • 43
  • 104
  • if I change the macro `PUGIXML_WCHAR_MODE`...do I need to build the PugiXML library again? – skm Dec 14 '16 at 09:02
  • I assume so. But I'm now thinking it may not help. The problem, it would seem is not with using `std::string`, but with using `std::cout`s `operator<<` directly. What happens when you use just `SystemName_node.print(std::cout);`? – StoryTeller - Unslander Monica Dec 14 '16 at 09:08
  • If I use it..i still get `├älpha` – skm Dec 14 '16 at 09:10
  • *"But I think that storing the variable in std::string is somehow creating problem"* It doesn't. `std::string` holds a sequence of `char`s. They can be used to enocde ascii or multibyte and the string is non the wiser. You'll only notice when you try to use on in an inappropriate place, such as printing utf-8 to a console that doesn't support utf-8. If the PDF library expects a utf-8 encoded char sequence, all will be well. – StoryTeller - Unslander Monica Dec 14 '16 at 09:16

1 Answers1

1

The problem here is that std::cout is just reflecting the UTF-8 bytes in the string to the console. Normally on Windows, the console is not running in UTF-8, but in (for example) code page 1252, so the two bytes of a UTF-8 'ä` get displayed as two characters.

Your solution is either to convert the console to UTF-8 (see this answer), or to convert your UTF-8 string into a CP-1252 string. I think this is going to require MultiByteToWideChar (specifying UTF-8) + WideCharToMultiByte (specifying CP-1252)

To debug your actual problem (passing UTF-8 strings into pugixml), you need to look at the actual bytes in the strings, and check they are what you think they are.

Community
  • 1
  • 1
  • Please read the `IMPORTANT` section where I have mentioned that printing on screen is not important. The important thing is to store the value correctly in `std::string ` so that they can be passed properly to PDF writer. – skm Dec 14 '16 at 09:15
  • So, you need to construct *another* [mvce] that shows creating a string with a UTF8 encoding (for example `"\0xC3\0xA4"), passing that to the PDF creation function, and see what output you get (you want 'ä'). If that doesn't work, you need to look at the documentation of the PDF functions and see if you can make them work. If not, you can post *that* example in yet another question. – Martin Bonner supports Monica Dec 14 '16 at 09:25