0

I'm making an interpreter for my own programming language, as a hobby. My problem is non-ascii characters are displayed incorrectly in the Windows CMD. The source file I'm reading is saved as UTF-8. I presume it's UTF-8 without BOM. When my source file says, for example;

print "á"

On my Mac I get the expected output. The letter á but on my PC I get ├í. I thought it was a code-page problem, the the code page I'm using has the letter á. Then I tried a different font. Lucida Grande works. But in the Python interpreter the letter á is displayed in the default font.

I asked people on StackOverflow and someone said my program was itself probably compiled with the wrong encoding. So my question is, how can I specify / change the encoding that is used when C++ compiles my files. I'm using TDM-GCC for my compiler, I've also used MinGW and had the same problem.

Thanks for your help

---EDIT---

Below is my entire source file. You can compile it like this:

c++ myfile.cc -o myprogram -std=c++11

Whenever I run "myprogram.exe somefile.mylang", where somefile.mylang says:

print "Hello á"

I get this output on the windows CMD:

"Hello á"

I don't know how Python, Lua, Ruby etc ... can use the default console font and output the correct character.

#include <iostream>
#include <string>
#include <fstream>
#include <sstream>
#include <vector>
#include <cstdlib>

using namespace std;

/* Global Variables */
/* Not all of these are actual "keywords" that can be used in programs.
   They are called keywords because they are reserved, either because they
   are specified as keywords in the grammar or because they are reserved by
   the interpreter for internal use. */
string keywords[9] = { "print", "string", "sc", "variable", "eq", "undefined", "nl", "num", "expr" };
/* We store tokens in a vector, we could use an array but specifying an arrays
   size at runtime is technically impossible and the work arounds are a pain. */
vector<string> tokens;

/* Our "symbol table" is just a vector too because, we can only determine how
   large the symbol table should be at runtime, so we use a vector to make things
   easier. */
vector<string> variables;

/* Function Declarations */
/* We declare all of the functions up here because it makes it easy to see how many
   functions we have and it makes it easier to find inefficiencies, also it makes the
   code look nicer. */
void exec_program();
string load_program();
string lex();
void parse();

void doPRINT();
void doASSIGN();
void goGETVAR();

/* Definitions */
/* These are our constants, these are defined as constant at the start of the program so
   that if anything goes wrong in the execution of the code we can always display the
   right kind of errors. */
#define IO_ERROR "[IO ERROR] "
#define SYNTAX_ERROR "[SYNTAX ERROR] "
#define ASSIGN_ERROR "[ASSIGN ERROR] "

/* We load the program into the interpreter by reading the file */
string load_program(string filename) {

    string filedata;

    ifstream rdfile(filename);
    /* We check to see whether or not we can open the file. This doesn't tell use whether
       the file exists because permissions could also prevent us being able to open the file. */
    if (!rdfile) {
        cout << IO_ERROR << "Unable to open the file \"" << filename << "\"." << endl;
        exit(0);
    }
    /* Loop through and grab each line of the file, then store each line in filedata. */
    for (std::string line; std::getline(rdfile, line); )
    {
        filedata += line;
        filedata += "\n";
    }

    /* Close the file when we're done. */
    rdfile.close();

    /* Return the data so that the rest of the program can use it. */
    return filedata;                       
}

void lex(string prog) {
    int i = 0;
    string toks = "";
  string n = "";
  string expr = "";
    bool state = 0;
  bool exprStarted = 0;
  bool isexpr = 0;
    string s = "";

    for(i = 0; i < prog.size(); ++i) {
        toks += prog[i];
        if (toks == " " and state == 0) {
        toks = "";
        if (n != "") {
          //isexpr = 1;
          //tokens.push_back(keywords[7] + ":" + n);
        }
        n = "";
      } else if (toks == ";" and state == 0) {
        toks = "";
        if (expr != "" and isexpr == 1) {
          tokens.push_back(keywords[8] + ":[" + expr + "]");
        } else if (n != "" and isexpr == 0) {
          tokens.push_back(keywords[7] + ":" + expr);
        }
        if (tokens.back() != "sc") {
          tokens.push_back(keywords[2]); 
        }
        n = "";
        expr = "";
        isexpr = 0;
      } else if (toks == "\n" and state == 0) {
            toks = "";
        if (expr != "" and isexpr == 1) {
          tokens.push_back(keywords[8] + ":[" + expr + "]");
        } else if (n != "" and isexpr == 0) {
          tokens.push_back(keywords[7] + ":" + expr);
        }
        if (tokens.back() != "sc") {
          tokens.push_back(keywords[2]); 
        }
        n = "";
        expr = "";
        isexpr = 0;
        } else if (toks == "0" or toks == "1" or toks == "2" or toks == "3" or toks == "4" or toks == "5" 
        or toks == "6" or toks == "7" or toks == "8" or toks == "9") {
        if (state == 0) {
          n += toks;
          expr += toks;
        } else {
          s += toks;
        }
        toks = "";
      } else if (toks == "+" or toks == "-" or toks == "*" or toks == "/") {
        expr += toks;
        isexpr = 1;
        toks = "";
        n = "";
      } else if (toks == keywords[0]) {
            tokens.push_back(keywords[0]);
            toks = "";
        } else if (toks == "\"") {
            if (state == 0) {
                state = 1;
            } else if (state == 1) {
                state = 0;
                tokens.push_back(keywords[1] + ":" + s + "\"");
                s = "";
                toks = "";
            }
        } else if (state == 1) {
            s += toks;
            toks = "";
        }
    }
    int ii = 0;
    while (ii < tokens.size()) {
        //cout << tokens[ii] << endl;
        ii++;
    }
}

string evalExpression(string expr) {
  int res = 0;
  int getnextnum = 0;
  int iter = 0;
  int it = 0;
  string opp = "";
  string num = "";
  string num1 = "";
  string num2 = "";
  string result = "";
  vector<string> numholder;

  for (char & c : expr) {
    if (c == '0' or c == '1' or c == '2' or c == '3' or c == '4' or c == '5' or
      c == '6' or c == '7' or c == '8' or c == '9') {
      // c is a number
      num += c;

    } else if (c == '+' or c == '-' or c == '*' or c == '/') {
      // c is an operator
      numholder.push_back(num);
      if (c == '+') {
        opp = "+";
      } else if (c == '-') {
        opp = "-";
      } else if (c == '*') {
        opp = "*";
      } else if (c == '/') {
        opp = "/";
      }
      numholder.push_back(opp);
      num = "";

    } else if (c == ']') {
      // end of expression
      numholder.push_back(num);

    } else if (c == '(' or c == ')') {
      // c is a round bracket

    }
  }

  for ( iter = 0; iter < numholder.size(); ++iter) {
    if (numholder[iter][0] == '+' or numholder[iter][0] == '-' or numholder[iter][0] == '*' or numholder[iter][0] == '/') {
      iter++;
    }
    if (numholder[iter][0] == '0' or '1' or '2' or '3' or '4' or '5' or '6' or '7' or '8' or '9') {
      // num = NUMBER
      if (num1 == "") {
        num1 = numholder[iter];
      }
      else if (num2 == "") {
        num2 = numholder[iter];
      }
    }

    if (iter-1 >= 0) {
        it = iter - 1;
        //cout << numholder[iter] << "    " << numholder[iter-1] << "    num1 = " << num1 << "    num2 = " << num2 << endl;

        if (numholder[it][0] == '+' and num1 != "" and num2 != "") {
          res = stoi(num1) + stoi(num2);
          num1 = to_string(res);
          num2 = "";
        } else if (numholder[it][0] == '-' and num1 != "" and num2 != "") {
          res = stoi(num1) - stoi(num2);
          num1 = to_string(res);
          num2 = "";
        } else if (numholder[it][0] == '*' and num1 != "" and num2 != "") {
          res = stoi(num1) * stoi(num2);
          num1 = to_string(res);
          num2 = "";
        } else if (numholder[it][0] == '/' and num1 != "" and num2 != "") {
          res = stoi(num1) / stoi(num2);
          num1 = to_string(res);
          num2 = "";
        }
    }
    //iter++;
  }
  numholder.clear();
  num1 = "";
  num2 = "";
  num = "";
  //cout << res << endl;
  expr = to_string(res);

  return expr;
}

void doPRINT(string toPrint) {
  if (toPrint.substr(0,6) == "string") {
    toPrint = toPrint.substr (7);
    toPrint = toPrint.substr(1,toPrint.size() - 2);
  } else if (toPrint.substr(0,3) == "num") {
    toPrint = toPrint.substr (4);
  } else if (toPrint.substr(0,4) == "expr") {
    toPrint = toPrint.substr (6);
    toPrint = evalExpression(toPrint);
  }
  cout << toPrint << endl;
}

void parse(vector<string> tokens) {
    int i = 0;
    while (i < tokens.size()) {

    if (tokens[i] + " " + tokens[i+1].substr(0,6) + " " + tokens[i+2] == "print string sc" or
        tokens[i] + " " + tokens[i+1].substr(0,3) + " " + tokens[i+2] == "print num sc" or
        tokens[i] + " " + tokens[i+1].substr(0,4) + " " + tokens[i+2] == "print expr sc") {
      doPRINT(tokens[i+1]);
      i+=3;
    }
    }
}

/* Main program exec function */
void exec_program(string filename) {
    lex(load_program(filename));
    parse(tokens);
}

/* The main function, we have to start somewhere. */
int main(int argc, char* argv[]) {

    if (!argv[1]) {
        cout << "Usage: reedoo <filename> [args]" << endl;
    } else {
    exec_program(argv[1]);
    }
    return 0;
}
Francis
  • 115
  • 5
  • 2
    I am not sure I understand the question. In any case, it's not C++ but the Windows console that has to be told about the character encoding. Either [Output unicode strings in Windows console app](http://stackoverflow.com/q/2492077/341970) or [How to make Unicode charset in cmd.exe by default?](http://stackoverflow.com/q/14109024/341970) may help. – Ali Jul 01 '14 at 22:06
  • 1
    redirect the output to a file, open with hex editor. If its the expected byte sequence, blame your terminal. If not, show code. – PlasmaHH Jul 01 '14 at 22:06
  • @Ali I added my source code. – Francis Jul 01 '14 at 22:25
  • @PlasmaHH I added my code. – Francis Jul 01 '14 at 22:26
  • OK, let's try to narrow down the problem, and let's check if I understand it correctly. Please [run this code](https://gist.github.com/baharev/731f3c6c8d15320d2336) and please put a file named data.txt in the appropriate directory. Put an `á` in this data.txt file. Does this program print `├í` on Windows? – Ali Jul 01 '14 at 23:16
  • @Ali yep. It prints the weird characters instead of the á. – Francis Jul 01 '14 at 23:33
  • OK, so this means that we narrowed down the problem to those few lines of code. Good. Now, what do you get with my program, if you follow the advice of those answers I link to in my first comment? Please try the `cmd /K chcp 65001` first; that's easy. (I am on Linux). – Ali Jul 01 '14 at 23:37
  • @Ali I apologise but I've had to leave my computer, I'm in my phone at the moment so I can't try that command. I'll run that command as soon as I can and I'll let you know the result. Thanks for your help so far! – Francis Jul 01 '14 at 23:41
  • No problem. However, keep in mind that I am going to bed now; it's late night here. – Ali Jul 01 '14 at 23:43
  • @Ali I changed the code page to 65001 like you said and it still gives me those weird characters instead of the á. – Francis Jul 02 '14 at 11:09
  • OK, I am looking at the problem on Windows. It seems like a stubborn problem and there are many bad answers here on Stackoverflow. Please give me some time. – Ali Jul 02 '14 at 12:15
  • Which version of Windows are you using / want to support? XP, 7, 8? – Ali Jul 02 '14 at 12:43
  • @Ali I'm using Windows 7 and I've tested it on Windows 8 and got the same result. Ideally I'd like to target as many versions of Windows as possible. – Francis Jul 02 '14 at 13:24
  • OK, the good news is that I can reproduce the problem, NONE of the answers here at Stackoverflow works for me. The bad news is that it will take a while to figure out a solution. Please be patient. – Ali Jul 02 '14 at 14:05
  • I have been struggling with this for hours and I give up. I am very sorry. The fact is that **none** of the suggested solutions that can be found here at Stackoverflow worked for me. The only thing that I haven't tried is [this suggested solution](http://stackoverflow.com/a/9051543/341970) with the Microsoft compiler. That should work, if you set the console to unicode (`chcp 65001`) *and* use Lucida fonts. The problem with the Mingw variants is that you need at least the 8.0 version of the CRT and I failed to link against it with mingw even though I tried very hard. Sorry. – Ali Jul 02 '14 at 18:36
  • @Ali Thanks for trying. I really appreciate that you tried for so long. :D – Francis Jul 02 '14 at 19:41

1 Answers1

-2

It is not about how you compile myprogram.exe, it is what myprogram.exe does with somefile.mylang

It is your duty as a language developer to say "the source files for programs in the mylang script should be utf-8" or provide a recognize a code page tag inside the source file. And you should also say "the strings in mylang language are encoded as UTF-foo" (because that affect operations like "hello".charAt(3) or whatever equivalent method you have there).

And then is the duty of your compiler/interpreter (myprogram.exe) to open the source (somefile.mylang) with the proper encoding, and convert it to the UTF-foo for the internal representation.

Mihai Nita
  • 5,547
  • 27
  • 27