-1

I have a text file I need to parse sentences from using a list of predefined delimiters.

One of the delimiters is ." and another is ?" to account for sentences ending in quotes.

When I get the value from the .txt file and store it in a string and print it out, it prints fine.

So,

inputFile >> s;
cout << s;

Would yield, say word."

But then when I use this code:

cout << s.substr(s.length()-2);

It prints, literally \235 to the console.

My delimiter algorithm relies on the value of this substring to be ".\""

Why is this happening? What even is this? This is causing my delimiter to not work, since "\235" != ".\""

Main function:


#include <fstream>
#include <string>
#include "iostream"

using namespace std;

int main(int argc, const char * argv[]) {

    string s;
    ifstream inputFile;
    inputFile.open("PATH_TO_FILE/test.txt");

    while (!inputFile.eof()) {
        inputFile >> s;
        cout << "String: " << s << endl;
        cout << "Sub: " << s.substr(s.length() - 2) << endl;
    }

    return 0;
}

test.txt:

“Of course.”

Output:

String: “Of
Sub: Of
String: course.”
Sub: \200\235
Program ended with exit code: 0
Josh
  • 69
  • 11
  • can you add input example and output please? – thisisjaymehta Nov 20 '19 at 06:22
  • Please provide a [mcve]. Either include the file content here, or remove the dependency on the input file (for example by changing it to `s = "word.\""`, if it causes the same behavior) – user202729 Nov 20 '19 at 06:24
  • 1
    The code you posted looks fine to me. The error is in the code you did not post. – zdf Nov 20 '19 at 06:32
  • Code added to post – Josh Nov 20 '19 at 06:40
  • Smart quotes. ---- – user202729 Nov 20 '19 at 06:40
  • Are you sure that you posted the test.txt file content correctly? The problem will only happen when the file content is `“Of course.”`, not `"Of course."`. – user202729 Nov 20 '19 at 06:42
  • Fixed it by copying and pasting the txt file into my post. I still don't understand the difference here, or what's going on, or how to fix it. – Josh Nov 20 '19 at 06:45
  • Not sure this is the problem (I'd imagine it'd fail to compile if it was), but usually standard library headers use `<>`, not `""` by standard convention (I'm referring to `"iostream"`, btw). –  Nov 20 '19 at 07:01

1 Answers1

0

Just use some hex editor to see what's the difference.

$ xxd <<< '“Of course.”'
00000000: e280 9c4f 6620 636f 7572 7365 2ee2 809d  ...Of course....
00000010: 0a                                       .
$ xxd <<< '"Of course."'
00000000: 224f 6620 636f 7572 7365 2e22 0a         "Of course.".

The character "RIGHT DOUBLE QUOTATION MARK" (U+201D) has representation in UTF8 e2 80 9d, and the last 2 bytes have octal value 200 and 235 respectively. The program doesn't print \200 literally, but it's the console that shows unprintable characters like that.

C++ std::string doesn't not handle UTF8 properly, so it slices part of the character.

If you want to convert the smart quotes to normal quotes in the program, C++ How to replace unusual quotes in code may work.

user202729
  • 3,358
  • 3
  • 25
  • 36
  • Also, I can't find any exactly duplicate. I suspect that there isn't anyone who have smart quotes in data anyway (mostly the problem would be copying code from some blog and have problem with invisible white SPHAEUSZ or smart quotes) – user202729 Nov 20 '19 at 06:55