0

I have a csv file that needs to be read into Matrix. Right now i have regex pattern

regex pat { R"(("[^"]+")|([^,]+))" }

i found similar topics from stackoverflow, but either theey used different regex pattern or were used with language other than c++. Right now it chooses between sequences that are between quotes and anything that is not comma. The file contains data from the survey that has questions with yes no answers. If you answer "no" you do not need to answer to some related questions. Therefore i get some sequences in file like this: ":,,,,,,,," Wheres each two commas mean an empty field. But i would like to remain the row as an equally numbered array. It seems that it would be easyer to later navigate through matrix to get information. So i would have to extract these empty fields between the commas. I could not find a regex pattern for empty sequence. Is regex pattern a proper way for solving this issue?

fredric
  • 63
  • 1
  • 8
  • Possible duplicate of [Regex to split a CSV](http://stackoverflow.com/questions/18144431/regex-to-split-a-csv) – cxw Jan 29 '16 at 13:34
  • Just replace pluses with stars, as in `[^,]*`. Note however that `"I contain ""quotes"""` is a valid CSV field containing properly escaped double quotes. Your regex will choke on it. – Igor Tandetnik Jan 29 '16 at 16:00
  • if i use [^,]* it only reads first field of each row. Is it regex problem or something else in the code. – fredric Jan 29 '16 at 17:46
  • Next time, please consider providing sample input and expected output. – Rumburak Jan 30 '16 at 08:22

1 Answers1

1

This code illustrates sample usage of the named pattern:

#include <iostream>
#include <iterator>
#include <string>
#include <regex>

int main()
{
  std::regex field_regex("(\"([^\"]*)\"|([^,]*))(,|$)");

  for (const std::string s : {
      "a,,hello,,o",
      "\"a\",,\"hello\",,\"o\"",
      ",,,,"})
  {
    std::cout << "parsing: " << s << "\n";
    std::cout << "======================================" << "\n";
    auto i = 0;
    for (auto it = std::sregex_iterator(s.begin(), s.end(), field_regex);
        it != std::sregex_iterator();
        ++it, ++i)
    {
      auto match = *it;
      auto extracted = match[2].length() ? match[2].str() : match[3].str();
      std::cout << "column[" << i << "]: " << extracted << "\n";
      if (match[4].length() == 0)
      {
        break;
      }
    }
    std::cout << "\n";
  }
}

Output:

parsing: a,,hello,,o
======================================
column[0]: a
column[1]: 
column[2]: hello
column[3]: 
column[4]: o

parsing: "a",,"hello",,"o"
======================================
column[0]: a
column[1]: 
column[2]: hello
column[3]: 
column[4]: o

parsing: ,,,,
======================================
column[0]: 
column[1]: 
column[2]: 
column[3]: 
column[4]: 
Rumburak
  • 3,416
  • 16
  • 27
  • this code worked but when i tried to add the quotes exception which i also need using regex l{ R"(("[^\n]*")|([^,]*),)" }; it did not work. It recognizes in quote but misses the other cases. Do you spot the difference – fredric Jan 29 '16 at 22:00
  • about the last comme: if the pattern does not read in the anything after last comma it means it does not extract the last field of a row. but that means it is not working. – fredric Jan 29 '16 at 22:04
  • @fredric thought you wanted some fun, too :-) But OK, see edited answer – Rumburak Jan 30 '16 at 08:21
  • I had lots of fun, unfortunetly without success. Thank you for help it seems to work with an extra feature of adding one empty field at the end of each row.. – fredric Jan 30 '16 at 09:58
  • @fredric Huh? Have you looked at the output of the edited version? It finds the content after the last comma, too. – Rumburak Jan 30 '16 at 10:00
  • yes it works fine for that purpose, but if there is not a last comma it outputs an empty field after last field. Also i had to modify (\"([^\"]*)\" part to (\"([^\n]*)\" or it did not extract the quotes in quotes. for example "aa,bb,"cc"" – fredric Jan 30 '16 at 10:08
  • @fredric OK, but you seem to know what you need to do now. – Rumburak Jan 30 '16 at 10:16
  • yes the input now works good enough i can continue working on data. thank you for help. – fredric Jan 30 '16 at 15:26
  • i noticed a problem. the regex pattern is not actually capable of recognizing embedded quotes. As i do not have those in my data file it is ok. But there are couple of questions people can write themselves. And if they use quote it may cause errors in input. Can this problem be handled with pattern. Also spending my time on the correct csv input makes me consider a parser. What are the most reliable for c++. – fredric Jan 31 '16 at 23:34
  • 1
    @fredric If you had given representative sample input in your question, including quotes in quotes, I would have told you from the beginning that you're better off using something other than regex. Since I think I answered the original question, I suggest you accept this answer and write a new question for choosing the appropriate parser tool. – Rumburak Feb 01 '16 at 05:28
  • yes asked a new question here http://stackoverflow.com/questions/35132819/looking-for-csv-parser-tool-for-cpp – fredric Feb 01 '16 at 14:20