How can I Portably Catch and Handle UTF "EN DASH" Minuses During c++ STL File Reading?

Question

I'm maintaining a large open source project, so I'm running into an odd fringe case on the I/O front.

When my app parses a user parameter file containing a line of text like the following:

CH3 CH2 CH2 CH2     −68.189775    2    180.0              ! TraPPE 1

...at first it looks innocent because it is formatted as desired. But then I see the minus is a UTF character (−) rather than (-).

I'm just using STL's >> with the ifstream object.

When it attempts to convert to a negative and fails on the UTF character STL apparently just sets the internal flag to "bad", which was triggering my logic that stops the reading process. This is sort of good as without that logic I would have had an even harder time tracking it down.

But it's definitely not my desired error handling. I want to catch common minus like characters when reading a double with >>, replace them and complete the conversion if the string is otherwise a properly formatted negative number.

This appears to be happening to my users relatively frequently as they're copying and pasting from programs (calculator or Excel perhaps in Windows?) to get their file values.

I was somewhat surprised not to find this problem on Stack Overflow, as it seems pretty ubiquitous. I found some reference to this on this question:

c++ error cannot be used as a function, some stray error [closed]

...but that was a slightly different problem, in which the code contained that kind of similar, but noncompatible "minus-like" EN DASH UTF character.

Does anyone have a good solution (preferably compact, portable, and reusable) for catch such bad minuses when reading doubles or signed integers?

Note:
I don't want to use Boost or c++11 as believe it or not some of my users on certain supercomputers don't have access to those libraries. I'm try to keep it as portable as possible.

Read each line one at a time into a string, apply any fixes, then parts the string (using a `stringstream`, `regex`, or whatever else works). — Alan Stokes, Jan 02 '15 at 10:34
Sure, but that's pretty nonspecific... I already know that I can read it to a `string` and then use replace with the codes for that character... http://www.fileformat.info/info/unicode/char/2013/index.htm ...maybe use a wrapper to wrap ifstream to catch that case? Anyhow, I want to see if anyone has actual code to deal with this... your answer, is appreciated, but it basically is where I'm at. I'm putting this out here in hopes some have dealt with this problem already and have a best practice/portable/compact solution. — Jason R. Mick, Jan 02 '15 at 10:40
The character you pasted into the question isn't u+2013, it's u+2212. You may need to code for multiple possibilities. — Mark Ransom, Jan 02 '15 at 19:03
Also, C++11 is not a library. You could simply statically link the required libraries anyway and ship them with your app. It would not make any difference here though. — Puppy, Jan 02 '15 at 19:04
P.S. If you're looking for others with a similar problem, search for "smart quotes". — Mark Ransom, Jan 02 '15 at 19:04

Oncaphillis · Accepted Answer · 2015-01-02T18:42:20.217

May be using a custom std::num_get is for you. Other character to value aspects can be overwritten as well.

#include <iostream> 
#include <string> 
#include <sstream> 

class num_get : public std::num_get<wchar_t> 
{ 
public: 
    iter_type do_get( iter_type begin, iter_type end, std::ios_base & str, 
                      std::ios_base::iostate & error, float & value ) const 
    { 
        bool neg=false; 
        if(*begin==8722) { 
            begin++; 
            neg=true; 
        } 

        iter_type i = std::num_get<wchar_t>::do_get(begin, end, str, error, value); 

        if (!(error & std::ios_base::failbit)) 
        { 
            if(neg) 
                value=-value; 
        }    
        return i; 
    } 
}; 

int main(int argc,char ** argv) {  

    std::locale new_locale(std::cin.getloc(), new num_get); 

    // Parsing wchar_t streams makes live easier but in principle
    // it should work with char (e.g. UTF8 as well)

    static const std::wstring ws(L"CH3 CH2 CH2 CH2     −68.189775    2    180.0              ! TraPPE 1"); 
    std::basic_stringstream<wchar_t> wss(ws);                                                                 
    std::wstring a; 
    std::wstring b; 
    std::wstring c; 
    float f=0; 

    // Imbue this new locale into wss 
    wss.imbue(new_locale);                 

    for(int i=0;i<4;i++) { 
        std::wstring s; 
        wss >> s >> std::ws; 
        std::wcerr << s << std::endl; 
    } 

    wss >> f;

    std::wcerr << f << std::endl; 
}

score 1 · Answer 2 · answered Jan 02 '15 at 19:04

Not gonna happen except manually. There are many characters in Unicode, there's an Em Dash as well as an En Dash, and most likely quite a few more. For example, did you consider the possibility of an Em Dash and then a non-breaking-space and then some numbers? Or an RTL override? Unicode is legend because the possibilities are nearly endless, and double-legend in C++ because the Standard support for it could be charitably described as ISIS's support for sanity.

The only real way to do this is to find each situation as your users report it, and handle it manually- i.e., do not use operator>> for double.

How can I Portably Catch and Handle UTF "EN DASH" Minuses During c++ STL File Reading?

2 Answers2