1

I regularly deal with files that look like this (for compatibility with R):

# comments
# more comments
col1 col2 col3
1 a hi
2 b there
. . .

Very often, I will want to read col2 into a vector or other container. It's not hard to write a function that parses this kind of file, but I would be surprised if there were no well tested library to do it for me. Does such a library exist? (As I say, it's not hard to roll your own, but as I am not a C++ expert, it would be some trouble for me to do use the templates that would allow me to use an arbitrary container to contain arbitrary data types.)

EDIT: I know the name of the column I want, but not what order the columns in this particular file will be in. Columns are separated by an unknown amount white space which may be tabs or spaces (probably not both). The first entry on each line may or may not be preceded by white space, sometimes that will change within one file, e.g.

number letter
 8 g
 9 h
10 i
flies
  • 2,017
  • 2
  • 24
  • 37
  • Save your file as CSV and use CSV parser? – garbagecollector Apr 13 '12 at 16:42
  • How big are the files? While this isn't particularly difficult, it's rare to find a solution that isn't absurdly slow. – Ben Voigt Apr 13 '12 at 16:43
  • Most often 100-1000 lines. The largest of them are ~10 million lines. I'm not so much concerned with performance as development cycle. – flies Apr 13 '12 at 16:52
  • I think i should probably just close this question. I guess thing I want is probably too particular to my situation to have a standard solution, though this surprises me as R reads and writes files like this, and surely there are people who are using both C++ and R. – flies Apr 13 '12 at 18:33

2 Answers2

2

I am not aware of any C++ library that will do this. A simple solution, however, would be to use linux cut. You would have to remove the comments first, which is easily done with sed:

sed -e '/^#/d' <your_file>

Then you could apply the following command which would select just the text from the third column:

cut -d' ' -f3 <your_file>

You could combine those together with a pipe to make it a single command:

sed -e '/^#/d' <your_file> | cut -d' ' -f3 <your_file>

You could run this command programmatically, then rather simply append each line to a stl container.

//  pseudocode
while(file.hasNextLine())
{
  container << file.readNextLine();
}

For how to actually run cut from within code, see this answer.

Community
  • 1
  • 1
Cory Klein
  • 51,188
  • 43
  • 183
  • 243
  • it seems like you'd have to parse the file first to remove comments and the header declaring column names, then pipe the result to cut. – flies Apr 13 '12 at 17:05
  • Is there a way to get the `cut` delimiter to be a variable length of white space composed of tabs and/or spaces? will it handle lines that begin with whitespace differently? `perl -e 'while (<>) { next if /^#/; chomp; print((split)[1], "\n"); }'` will give me the 2nd column in a file excluding comments, but I fail to see the advantage of any of this over reading and splitting within C++. – flies Apr 13 '12 at 18:16
  • I think you are right, if `cut` isn't going to work because single-character delimeters aren't enough for the files you will encounter, it is likely easier to just do all this in code with splitting in C++. – Cory Klein Apr 13 '12 at 18:40
2

Boost split may do what you want, providing you can consistently split on whitespace.

01100110
  • 2,294
  • 3
  • 23
  • 32
  • The columns will be delimited by whitespace (variable length, spaces and/or tabs). Splitting is not too hard - http://stackoverflow.com/questions/236129/how-to-split-a-string-in-c – flies Apr 13 '12 at 17:16
  • This is also a viable option. Loop through each line, and `split` on the whitespace, then put the resulting list into a 2d array. Then you could run down the 2d array selecting the item you want from the proper column. – Cory Klein Apr 13 '12 at 17:40