2

I'm a beginner in C++ so I hope you bear with me.

Trying to read a file which in text format each has lines that either look like this (the first few lines, called header lines):

@HD VN:1.5  SO:queryname

or like this

read.1  4   *   0   0   *   *   0   0   CAACCNNTACCACAGCCCGANGCATTAACAACTTAANNNCNNNTNNANNNNNNNNNNNNTTGAAAAAAAAAAAAAAAAAA    A<.AA##F..<F)<)FF))<#A<7<F.)FA.FAA.)###.###F##)############)FF)A<..A..7A....<F.A    XC:Z:CAACCNNTACCA   RG:Z:A  XQ:i:2

Both are tab delimited.

The file is very large and therefore is in binary format. I'm wondering whether it is possible to read from the binary format file each line, do some processing on that line, and then write it to a binary format output file.

I started with this code:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(int argc, char* argv[])
{
  string input_file = argv[1];
  string output_file = argv[2];
  string line;
  ifstream istream;
  istream.open(input_file.c_str(),ios::binary|ios::in);
  ofstream ostream;
  ostream.open(output_file.c_str(),ios::binary|ios::out);
  while(getline(istream,line,'\n')){
    if(line.empty()) continue;
    //process line assuming it is read as a string
    ostream<<line<<endl;
  }
  istream.close();
  ostream.close();
}

But it crashes with: Segmentation fault (core dumped), in the part where I'm trying to parse line to a string vector.

Is there a way to read the binary format and split it by lines, do string processing on each such line, and then write them to a binary output?

BTW, I'm running this on Linux.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
dan
  • 6,048
  • 10
  • 57
  • 125
  • 2
    Every file is, in principle, binary, because that's just how computers work. Now, saying "I'm trying to read it line-by-line" clearly means you're treating it as a text file. Now, if it's a text file, this is probably not a problem, but if it's a large file without any newline character, you're probably just using up all your RAM. – Marcus Müller May 07 '16 at 20:20
  • by the way, `argv[0]` is usually your program's executable name, **not** the first argument specified when running it. – Marcus Müller May 07 '16 at 20:21
  • "it crashes" is also not a proper problem description. What does it do? a Segfault? What does your debugger say? – Marcus Müller May 07 '16 at 20:22
  • 1
    Nitpicking, but: This won't compile. Please make 100% sure the code you post is **identical** to what you can try yourself. – Marcus Müller May 07 '16 at 20:26
  • Really, we can't help you if your example doesn't compile. This one is missing a namespace declaration; we can fix this in our local copies (when trying out your code), but it's really no use we fix something that is different from what runs on your machine,. – Marcus Müller May 07 '16 at 20:29
  • How large are the files you are trying to read in? Have you tried this with small files? Also you mention that you are attempting to parse `line` - there could be an issue with this code, so it's best if you provide that here too. – sjrowlinson May 07 '16 at 20:31
  • You need to check `argc` to see if it's at least 3. By the way, if this isn't actually a text file, there's no definition of a 'line' to read from the file this way. – Spencer May 07 '16 at 20:33
  • The size of the input file (you quote 30 GB) is most likely the major issue here - why do you have such large files in the first place?! – sjrowlinson May 07 '16 at 20:35
  • I see. So should I read it by buffers and parse each buffer by the new line character? – dan May 07 '16 at 20:35
  • It's next generation sequencing data and that's pretty standard size for these types of data – dan May 07 '16 at 20:37
  • @dan Binary files do not have "lines", and that `\n` will not work if the file is opened in binary mode. That `\n` is a text-mode, synthesized character sequence that is equal to a combination of a carriage return-line feed, or just a line-feed, depending on whether the text file is Windows or Linux text file. So in short, your code won't work if this is how you're going to process it. You have to explicitly test for the CR-LF character sequence or the LF character if the file is opened in binary mode, and that requires you to know or detect what end-of-line sequence is used in the file. – PaulMcKenzie May 07 '16 at 20:38
  • 2
    @PaulMcKenzie: Sorry, that's not right. The LF character is \n. If you are on Windows, the variable `line` will have a trailing `\r` character, but that is all. (There are old Macs where lines are separated by `\r` and not `\n`, but it is unlikely the OP is processing 30GB files on one of those. There are also filing systems where there really *is* no end of line character - but I'll bet he isn't using one of those either.) – Martin Bonner supports Monica May 07 '16 at 20:45
  • @MartinBonner -- ok. But my main point is that if the file is opened in binary mode as opposed to text mode, you have no runtime library help in processing a file line-by-line. The programmer has to figure out what "end-of-line" is supposed to mean. – PaulMcKenzie May 07 '16 at 20:51
  • 1
    @PaulMcKenzie nah. I feel bad for contradicting you, but the file mode doesn't matter in this case. `getline` is explicitly told what the delimiter is. – Marcus Müller May 07 '16 at 20:52
  • @dan - Q1: Does the code *as posted* crash? Or only when you do some "parse line to a string vector" (which you haven't shown). Q2 : How many line in your 30GB file? How long (to an order of magnitude) is the longest line? – Roddy May 07 '16 at 21:12
  • Hi Roddy. A1. The code crashes when I'm trying to parse line by '\t'. A2. The file contains ~200 million lines. The first 2 lines have the first format shown above and the rest have the second format shown above, So the length is fixed for all these remaining lines. – dan May 07 '16 at 21:17
  • @MarcusMüller I see that. But for me, when I do "binary processing", I don't think in text mode. It's just a run of characters with no special meaning for any of those characters. That's why I would expect to `read()` the characters, as opposed to `getline()`. – PaulMcKenzie May 07 '16 at 21:18
  • @dan So if the file were not 200 million lines, but 3 or 4 lines with the format you described, would your application still crash? – PaulMcKenzie May 07 '16 at 21:19
  • Yes, because my problem is that line is not a string. – dan May 07 '16 at 21:33
  • @dan why didnt you include parsing code if that is where crash happens? – marcinj May 07 '16 at 21:59
  • That code assumed that line is a string, which is not the case, so I'm not what would have been the point in doing that. – dan May 07 '16 at 22:03
  • What do you mean "line is not a string". It is of type std::string, what other properties does it need to have to be "a string" according to your definition, that it doesn't have. – Martin Bonner supports Monica May 08 '16 at 07:00

3 Answers3

10

Is it possible to read a binary file line by line?

Every file is, in principle, binary, because that's just how computers work. Now, saying "I'm trying to read it line-by-line" clearly means you're treating it as a text file – "line" is a text concept.

The file is very large and therefore is in binary format.

That's top-notch bullshit. Size doesn't change the format of your file.

How do I get each line as a string? Does the ostream<<line<<endl; work for writing a string to a binary file?

Yes and no: if you're file is not a text file, why is it important where these '\n' characters are? To a non-text file, these are just normal bytes like 'a' or \0x00 or 0xFF. So basically, you're looking at ingrain wallpaper and try to spot letters in there.

However, with your illustration of the files we're talking about, they are in fact files that only contain text.

So your problem seems to lie in the fact that a single line might exceed what storage you have available in std::string. That's a rare case – but it can happen for genetic strings, it seems. Well.

Get yourself familiar with the non-text-oriented file I/O that C++ has. Basically, there's ifstream.read() and you should use it to get a (limited) amount of bytes, do your processing, write to output, repeat. Look out for the newline character in your input, and "rewind" your file (fseek) if you've read past it.

Also, I really wonder how long your lines have to become to break std::string. I guess you might be running on some very limited OS (32 bit?) or computer (very little RAM + Swap?).

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
2

If your file is structured into lines, and each line is terminated with a \n then it is a text file. Every file is binary underneath, and text files are just a special kind of binary file.

So, given that, the code you've shown is likely to work fine for files of any size.

You should really remove the ios:binary, but I don't expect it to make any difference in this case.

But if you're getting a crash while "processing" a line of the file, that's where the the bug is most likely to be - in the code you haven't disclosed - yet!

Roddy
  • 66,617
  • 42
  • 165
  • 277
1

It looks like you file has some other line endings than you expect. It could have a \r while you expect it to have \n. If that is the case then std::getline tries to read whole 30GB file in to line std::string.

I suggest you check what line ending you have in your file, to verify above. If that is the case then you can use line reading function from this SO: Getting std :: ifstream to handle LF, CR, and CRLF? which should read lines even if they have ending non compatible with your platform (or rather endings which you do not expect).

also, you should be fine using non-binary file mode. The sample file lines you have shown in question does not look very binary to me.

Community
  • 1
  • 1
marcinj
  • 48,511
  • 9
  • 79
  • 100
  • Ok guys. Thanks a lot for your help (and again for the lovely illustration - Marcus Müller). I'm going to delete this post since it's leading nowhere. – dan May 07 '16 at 21:41