2

I have a file containing a header and a very long string like:

>Ecoli100k
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
....

I tried to retrieve the file size and header size using:

ifstream file(fileName.c_str(), ifstream::in | ifstream::binary);

string line1;
getline(file,line1);
int line1Size = line1.size();

file.seekg(0, ios::end);
long long fileSize = file.tellg();
file.close();

And for example for a file containing a string of length 100k with header >Ecoli100k, fileSize is 101261 and line1Size is 10. now for calculating the length of the string without reading anymore:

101261 - (10+1) = 101250 that means without the header, this file contains 101250 more characters

101250/81 = 1250 that means there's 1250 full lines (but the last line has no \n) so we must subtract 1249 from 101250 to get the length of the string, but it is wrong. we get 100k+1 instead of 100k.

In code:

int remainedLineCount = 
        (fileSize - line1Size - 1 - 1 /*the last line has no \n*/)/81 ;
cout<<(fileSize - line1Size - 1 - remainedLineCount )<<"\n";

in another example i only add another character and because of a newline in file the size changes to 101263 and again with this calculation we will get into 100k+2 instead of 100k+1.

Anyone know where this [[ extra 1 ]] comes from? is there anything at the the end of a file?

Edit:

As requested, here is the binary value (in hexadecimal) of the bytes at begin and end of the file:

offset 0: 3e 45 63 6f 6c 69 31 30 30 6b

offset 0000018b83: 54 47 47 43 41 47 41 41 43 0a

Thanks All.

Community
  • 1
  • 1
ameerosein
  • 523
  • 3
  • 16
  • What character is used to mark the end of the string? – Thomas Matthews Oct 06 '16 at 18:22
  • 1
    You would be better off if you started looking at the file you're processing in a good hex editor (not a plain text editor), one that shows the byte offset of each of the data you're looking for. Then given the information you gathered from looking into the file, tailor your program to fit. The answer given by Christophe shows the different scenarios, but you could had easily discovered this if you initially inspected the file. – PaulMcKenzie Oct 06 '16 at 18:37
  • @ThomasMatthews Nothing i guess... i write the file myself from Rstudio and add nothing at the end, but write that string from DNAString or DNAStringset from Bioconductor Packages in R. could you explain more how to check? and please check out the Edit in my post... – ameerosein Oct 07 '16 at 05:57

1 Answers1

3

There are several candidates:

  • If you're under windows, and if the file was written in text mode, then the first line + the newline will be stored on 10+2 chars, as '\n' is translated into '\r'+'\n';
  • again, if the file was written in text mode, it is possible that an end of file char was added (not visible in text mode), that becomes readable in binary mode.
  • it is also implmenetation dependent whether or not a '\n' is added to the last line of the file (see explanations under my second edit)

Additional reading:

Edit:

In case of doubt about the encoding, you could display the binary value (in hexadecimal) of the bytes at begin and end of your file:

void show (istream &ifs, int count) {  // utility function
    cout <<"offset "<<setw(10)<<ifs.tellg()<<": ";
    for (int i=0; i<10; i++) 
        cout << setw(2) << setfill('0') <<hex<<ifs.get()<<" ";
    cout <<endl; 
}

// with your newly opened filestream: 
show(ifs, 12);  
ifs.seekg(-10,ios::end);
show(ifs, 10);  

Edit 2:

So it appears that you have a newline at the end of your last line (ending ASCII code 0a in your output).

It's important to understand that text mode and binary mode may have differences. The C++ standard doesn't detail these but relies in its section 27.1.9.4 on the C stdio, which are described in the C11 standard:

7.21.2/2: A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one- to-one correspondence between the characters in a stream and those in the external representation. Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. Whether space characters that are written out immediately before a new-line character appear when read in is implementation-defined.

Community
  • 1
  • 1
Christophe
  • 68,716
  • 7
  • 72
  • 138