0

I'm reading into a 2d vector from a CSV file. However, when it reads this cell, it assumes it is the end of the line and thus makes that particular vector have only a size of 1.

"HAM/  G60-II (1111-66F)
SION-01",

This is what it looks like when opening the CSV file in notepad. When I copied it here, it automatically put a newline character it looks like, but it doesn't look like that in notepad. Here's a snip of what it looks like in notepad.

Also, it's weird because when I look at that cell in Excel, the "SION-01" is nowhere to be found, no matter how far I expand the column. However, when I control+f the document for it, it points right to that cell...it's weird. It looks like there's a newline character in the actual value which is causing the problem. I have one idea but I have no idea how to implement it. It would be to read the values between the commas, but then I wouldn't know when the line actually ends. I really don't know how to proceed with this. Most of the excel files that I read with my program have values separated by quotes but some don't.

ifstream file(filename);
while (file)
{
    string line;
    if (!getline(file, line)) break;

    istringstream ss(line);
    vector<string> words;
    while (ss)
    {
        string s;
        if (!getline(ss, s, ','))break;
        words.push_back(s);

    }

    list.push_back(words);
NathanOliver
  • 171,901
  • 28
  • 288
  • 402
Colebacha2
  • 27
  • 1
  • 6
  • 1
    Your loop `while (file)` is in a way the same as `while (!file.eof())`. And `while (!file.eof())` is [considered wrong](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong). – Some programmer dude Jun 28 '18 at 13:23
  • 2
    As for your problem, this might be a good time to [learn how to debug your programs](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/). – Some programmer dude Jun 28 '18 at 13:24
  • What would you suggest instead of `while (file)`? And I spent awhile debugging it and I know the reason is because of the newline character within the actual value but I don't know how to take that into account. – Colebacha2 Jun 28 '18 at 13:26
  • 1
    csv with quoted fields and embedded commas and newlines isn't simple to parse. `while(getline(...)) {}` is generally a better way to compose the loop, but the way you're doing it is fine as well. The important thing is to test if your input succeeded and not the eof flag so you don't process the failed input. – Retired Ninja Jun 28 '18 at 13:26
  • You have a special character in that cell that a) prevents "SION-01" from being displayed but is written into the CSV-File nethertheless which b) causes notepad and getline() to assume "end of line". Edit the cell and delete everything after the relevant content (even if nothing more is displayed) and then try again. – Rene Jun 28 '18 at 13:26
  • Is the file Unicode or ASCII? If it is ASCII, how was it created, different operating systems can use different characters for things like new lines. Also, in general, don't use Notepad for things like this, it is not very reliable. – Qubit Jun 28 '18 at 13:27
  • @Rene That is definitely a simple fix; however, there are potentially many items like this and I'll be using my program to analyze CSV files from multiple sources. It's not realistic for me to go through manually and fix these. – Colebacha2 Jun 28 '18 at 13:28
  • Among other problems, the way you parse the file is wrong: data may contain commas, in which case `getline` won't work as expected. [Full CSV specifications](https://tools.ietf.org/html/rfc4180) – zdf Jun 28 '18 at 13:29
  • Actually, when dealing with CSV files I really suggest you try to find an existing library which can read and parse them for you properly. CSV files might seem to be simple on the surface, but there are so many special and corner cases that it easily becomes very hard and complex to do. Using a library which can already handle all those special and corner cases will make your life much easier. – Some programmer dude Jun 28 '18 at 13:29
  • The answer to what you should do instead of `while (file)` is in the link provided. – Some programmer dude Jun 28 '18 at 13:31
  • @Qubit I believe it is ASCII but I really don't know. It was generated on a server but I have no idea what OS it is using. – Colebacha2 Jun 28 '18 at 13:31
  • If you open up the cell that starts "HAM/ G60-II", you may find it has a second line of text in that cell. If you can edit the newline away, then that is an easy fix. If you need to preserve that newline you will need to parse the text. If it has an opening quote, then you need to accept every character after the quote as part of the field until the next quote, except where that quote is preceded by an odd number of backslash. – Gem Taylor Jun 28 '18 at 13:33
  • @rustyx It already reads until the EOL in the istringstream and then I parse that inside the while loop with getline. I wish I could read it until EOF but then I wouldn't know when the newlines were since there appear to be newline characters inside values. – Colebacha2 Jun 28 '18 at 13:34
  • @GemTaylor I think my only choice is to use a CSV parsing library as SomeProgrammerDude suggested or to do this. It isn't feasible for me to manually delete the newlines from every value that has them (there's over 40k entries in this file). Is there a CSV parser that handles special cases like this? – Colebacha2 Jun 28 '18 at 13:37
  • A parsing library is the best answer if you have no control over the file contents. OR You could use search-and-replace inside the spreadsheet program to replace all newlines, commas, and quotes with metachars. OR If the data is only 1 column (not really comma-separated) then you can do some easy cheats that will work 99% of the time, but you are paying for a library that solves that other 1% – Gem Taylor Jun 28 '18 at 13:49
  • You can write a fairly simple state machine to parse the data that is aware of how quoted fields work and escaped characters work, but as @Someprogrammerdude said, there are some tricky cases you may have to deal with in the CSV spec so an existing library might be a better choice. If you choose to try and do it you won't be able to use getline, you'll generally need to parse character by character and maintain state in the parser to know if the next comma or quote terminates the field. – Retired Ninja Jun 28 '18 at 13:50
  • @GemTaylor Which parsing library would you recommend? Search-and-replace seems doable but I don't want it replacing the newline characters that I care about, the ones that actually would mean a new row in the document. And I would prefer to be able to do it programatically. – Colebacha2 Jun 28 '18 at 13:53
  • I'm suggesting that if you use the search and replace function in your spreadsheet application, then it will preserve all the cell locations, but leave the content easier to parse. If you are going to do it regularly, you could write a macro to do the cleanup and then the export. This all depends on what tools you are using to generate the csv, of course. – Gem Taylor Jun 28 '18 at 16:54

0 Answers0