1

I have been trying to fix this problem for days and I can't get it. Basically my code is supposed to read a .csv file produced by wmic and save it to a struct. I can read the data and it is being stored, but the data has an extra space after each character. I have tried switching to the Unicode versions of the functions and using wide strings but they only messed up the data even more (they turned a "n" into a "ÿ").

Here is the code that I think is the issue:

system("wmic product get name,version,installdate,vendor /format:csv > product.txt");

std::ifstream infoFile("./program.txt"); // The file wmic wrote in csv format.

if(infoFile.is_open())
{
    std::string line;
    int lineNum = 0;

    while(getline(infoFile, line))
    {
        lineNum++;
        std::cout << "\nLine #" << lineNum << ":" << std::endl;

        Program temp;
        std::istringstream lineStream(line);
        std::string cell;
        int counter = 0;
        int cellNum = 0;

        while(getline(linestream, cell, ','))
        {
            cellNum++;
            std::cout << "\nCell #" << cellNum << ":" << cell << std::endl;

            switch(counter)
            {
            case 0:
                break;
            case 1:
                temp.installDate = cell;
                break;
            case 2:
                temp.name = cell;
                break;
            case 3:
                temp.vendor = cell;
                break;
            case 4:
                temp.version = cell;
                break;
            default:
                std::cout << "GetProductInfo(): Invalid switch value: " << counter << std::endl;
                break;
            }
            counter++;
        }

        information->push_back(temp); // Vector to save all of the programs.
    }

    infoFile.close();
}
else
{
    std::cout << "GetProductInfo(): Failed to open the input file." << std::endl;
    return 1;
}

return 0;
}

Edit: O.K., I am trying to write the BOM (FF FE 0D 00 0A) as it wasn't being written before. I am writing a char array with the hex values but there is a extra 0x0D being added (FF FE 0D 00 0D 0A). It is also saving internal variables with the extra spaces. That might not be an issue as I can modify my code to account for it but that wouldn't be optimal. Any ideas?

Edit2: So I guess I don't need the BOM. My main problem now is just reading the UTF-16LE file and saving the data to a struct without the extra spaces. I need some help doing it the right way as I would like to figure out how to prevent this in the future. Thanks for you help everyone, this bug is critical.

IWillByte
  • 13
  • 1
  • 4

4 Answers4

5

This smelled a lot like a text encoding problem, so I went ahead and tried running the command you provided, and sure enough, the output file is encoded in UCS16LE. (That's 16-bit chars, little-endian.) Try opening the file in a hex editor to see what it actually looks like.

You were on the right path when trying to use wide strings, but dealing with Unicode can be tricky. The next few paragraphs will give you some tips on how to deal with this the hard way, but if you need a quick and easy solution, jump to the end.

There's two thing to be careful of. First, make sure you're also using the wide streams, like wcout. It's worth casting each character to an int to double-check that there isn't a problem with the output formatting.

Second, the format of wcout, wstring, etc, is not standard. On some compilers it's 2 bytes per char, and on others it's 4. You can usually change this in your compiler settings. C++11 also provides std::u16string and std::u32string, which are more explicit about their size.

Reading Unicode text can unfortunately be quite a bit of a hassle with the C++ library, because even if you have the right string size, you need to deal with BOMs and endian formats, not to mention canonicalization.

There's libraries to help with this, but the simplest solution might just be to open the txt file in Notepad, choose Save As, then choose an encoding you're more comfortable with, like ANSI.

Edit: If you're not happy with the quick and dirty solution, and you don't want to use a better Unicode library, you can do this with the standard library, but only if you're using a compiler that supports C++11, such as Visual Studio 2012.

C++11 added some codecvt facets to handle converting between different Unicode file types. This should suit your purpose, but the underlying design of this part of the library was designed in the days or yore, and can be rather difficult to understand. Hold on to your pants.

Below the line where you open your ifstream, add this code:

infoFile.imbue(std::locale(infoFile.getloc(), new std::codecvt_utf16<char, 0x10FFFF, std::consume_header>));

I know that looks a bit scary. What it's doing is making a "locale" from a copy of the existing locale, then adding a "facet" to the locale which handles the format conversion.

"Locales" handle a whole bunch of stuff, mostly related to localization (such as how to punctuate currency, eg "100.00" vs "100,00"). Each of the rules in the locale is called a facet. In the C++ standard library, file encoding is treated as one of these facets.

(Background: In retrospect, it probably wasn't a very wise idea to mix file encoding up with localization, but at the time this part of the library was designed, file encoding was typically dictated by the language of the program, so that's how we got into this situation.)

So the locale constructor above is taking a copy of the default locale created by the file stream as its first parameter, and the second parameter is the new facet to use.

codecvt_utf16 is a facet for converting to and from utf-16. The first parameter is the "wide" type, which is to say, the type used by the program, rather than the type used in the byte stream. I specified char here, and that works with Visual Studio, but it's not actually valid according to the standard. I'll get to that later.

The second parameter is the maximum Unicode value you want to accept without throwing an error, and for the foreseeable future, 0x10FFFF represents the largest Unicode character.

The final parameter is a bitmask that changes the behaviour of the facet. I thought std::consume_header would be particularly useful for you, since wmic outputs a BOM (at least on my machine). This will consume that BOM, and chose whether to treat it as a little- or big-endian stream depending on what it gets.

You'll also notice that I'm creating the facet on the stack with new, but I'm not calling delete anywhere. This is not a very safe way to design a library in modern C++, but like I said, locales are a rather old part of the library.

Rest assured that you don't need to delete this facet. This isn't really documented very well (since locales are so rarely used in practice), but a default-constructed facet will be automatically deleted by the locale it's attached to.

Now, remember how I said it's not valid to use char as the wide type? The standard says you have to use whcar_t, char16_t or char32_t, and if you want to support non-ASCII characters, you'll definitely want to do this. The easiest way to make this valid would be to use wchar_t, change ifstream, string, cout, and istringstream to wifstream, wstring, wcout, and wistringstream, then make sure your strings/char constants have an L in front of them, like so:

std::wcout << L"\nLine #" << lineNum << L":" << line << std::endl;

Those are all the changes you need in order to use wide strings. However, also beware that the Windows console cannot handle non-ANSI characters, so if you try to output such a character (when I ran the code I hit a ™ character), the wcout stream will be invalidated and stop outputting anything. If you're outputting to a file, this shouldn't be a problem.

You can probably tell that I'm not particularly thrilled about this part of the standard library. In practice, most people who want to use Unicode will use a different library (like the ones I mentioned in the comments), or roll their own encoders/decoders.

Rick Yorgason
  • 1,616
  • 14
  • 22
  • Thank you for the very informative answer! Is there a way to automate the file encoding conversion? I am trying to write a tool that should require little to no user interaction. – IWillByte Jun 03 '13 at 19:56
  • There's lots of ways to automate it. The canonical library is called ICU, although it's a little complicated. Qt's QTextStream is easier to use. If you're happy with a just a hack and you only need to support versions of Windows that have PowerShell, you can use the [Set-Content](http://www.powershelladmin.com/wiki/Convert_from_most_encodings_to_utf8_with_powershell) cmdlet. – Rick Yorgason Jun 04 '13 at 02:10
  • O.K., I am trying to write the BOM (FF FE 0D 00 0A) as it wasn't being written before. I am writing a char array with the hex values but there is a extra 0x0D being added (FF FE 0D 00 0D 0A). Any ideas? Thanks again for the help by the way. – IWillByte Jun 17 '13 at 16:03
  • You might want to start another question for that with a code snippet, but I have a few thoughts: (1) Are you sure you want a BOM? If you know which format to expect, it can be easier to work with files that have no BOM. (2) The BOM is _not_ FF FE 0D 0A. The BOM is just FF FE. If you had FF FE 0D 00 0A __00__, then that would be the 16LE BOM followed by a carriage return and a newline. (3) If you're working with a `char` array, double check your casting. Remember that if you cast a `char` array to a `wchar_t` array (for example), you'll get neighboring chars smushed together. – Rick Yorgason Jun 19 '13 at 00:09
  • Thank you, I guess I don't need the BOM. I was trying to add it so I can open it in notepad, etc, but that isn't really needed. My issue now is how I read data from wmic (encoded with UTF-16LE) and save it to a c++ struct without the extra spaces. – IWillByte Jun 19 '13 at 17:15
  • Okay, I gave an explanation about how to read Unicode files using only the standard library. – Rick Yorgason Jun 20 '13 at 01:45
  • Rick, you are amazing. Your answer was better than any tutorial I have found on the subject. Not only was it the "right" way to do it, you also used the standard library (even though it can be a pain). I am no longer getting extra spaces but some lines are being skipped in the console, maybe because of non-ANSI characters like you said might happen. I am sure I can figure it out though. – IWillByte Jun 21 '13 at 18:18
  • Glad it helped. If you're looking for an authoritative source on the details of the standard library (short of the reading the standard itself) I would suggest picking up The C++ Standard Library by Nicolai M. Josuttis. – Rick Yorgason Jun 25 '13 at 03:17
0

If your data doesn't have any spaces you need you can use my example:

std::string s = "test, delim, ";
std::string delims = ", ";

size_t pos = 0;
std::string token;

while((pos=s.find(delimiter))!=std::string::npos)) 
{ token = s.substr(0,pos);
  std::cout<<token<<std::endl;
  s.erase(0, pos + delimiter.length());
}
std::cout<<s<<std::endl //last word

Alternatively, you can use strtok from cstring library. You can also check my question, it is pretty the same: strtok() analogue in C++

Community
  • 1
  • 1
vladfau
  • 1,003
  • 11
  • 22
  • I need the spaces because I am saving program names and vendors which often have multiple words. – IWillByte May 31 '13 at 19:47
  • You can try `erase` method for strings to remove the last charcter. – vladfau May 31 '13 at 19:50
  • What do you mean? If you are saying that it will remove the last character of a cell, that wont work because there is a space after every character, not just the last. – IWillByte May 31 '13 at 19:57
  • As far as I understood here: `getline(linestream, cell, ','))` you split your string by delimiter ','. Then, you can call `cell.erase(cell.end()-1,cell.end())`; – vladfau May 31 '13 at 20:06
  • Wouldn't that just remove the last character? The problem is that each character in 'cell' has a space after it. – IWillByte May 31 '13 at 20:19
  • Do you mean that `string` comes as `s t r i n g`? – vladfau May 31 '13 at 20:20
  • Yes, there is a space after every character. I am also writing the data to a txt file and noticed that there is a extra blank line after the version and before the install date. – IWillByte May 31 '13 at 20:33
  • Then, I'd rather recommend you use `strtok` from `cstring`. [Example](http://www.cplusplus.com/reference/cstring/strtok/). I can't find the reason, why does stringstream works that way. Just use `cell.c_string()` to convert `string` to `(const) char*`. – vladfau May 31 '13 at 20:42
0

If the data has an extra space after every character, I suppose it means it also has an extra space after a regular space.

Thus you can safely erase every space (every char, actually) which does not have another space right before it. This assumes you do not have two white spaces in a row in the original data, but if you do, you just need one extra flag to take care of that.

So your code could become something like this:

while(getline(infoFile, line))
{
    int lsize = line.size(), at = 1;
    for(int i = 1; i < lsize; ++i)
        if(line[i-1] == ' ') line[at++] = line[i];
        // if there is no space behind it, skip it, it is a broken space itself!
    line.resize(at);

    lineNum++;
    // std::cout << "\nLine #"...

I realize this is not completelly ideal as you're not actually stopping the core problem from happening, but considering you've been trying for days, this at least effectively mitigates the problem by fixing it after it has happened.

Check the live demo.

i Code 4 Food
  • 2,144
  • 1
  • 15
  • 21
0

In my case, I solved the problem by changing the encoding to utf8 using Notepad++.

  1. From Encoding menu:

enter image description here

  1. Click on utf8 to do the change, and save:

enter image description here

ibra
  • 1,164
  • 1
  • 11
  • 26