I'm writing code that reads huge text files containing DNA bases and I need to be able to extract specific parts. The file looks like this:
TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGGGG
...
Every line is 30 characters.
I have a separate file indicating these parts, meaning I have a start value and an end value. So for each start and end value, I need to extract the corresponding string in the file. For example, if I have start=10, end=45, I need to store the string which starts at the 10th character of the first line (C) and ends at the 15th character of the 2nd line (C) in a separate temporary file.
I tried using the fread function as seen below for a test file with the above lines of letters. The parameters were start=1, end=90 and the resulting file looks like this:
TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGG™eRV
Each run will give random characters at the end.
The code:
FILE* fp;
fp=fopen(filename, "r");
if (fp==NULL) puts("Failed to open file");
int start=1, end=90;
char string[end-start+2]; //characters from start to end = end-start+1
fseek(fp, start-1, SEEK_SET);
fread(exon,1, end-start+1, fp);
FILE* tp;
tp=fopen("exon", "w");
if (tp==NULL) puts("Failed to make tmp file");
fprintf(tp, "%s\n", string);
fclose(tp);
I couldn't understand how fread handles \n characters so I tried replacing it with the following:
int i=0;
char ch;
while (!feof(fp))
{
ch=fgetc(fp);
if (ch != '\n')
{
string[i]=ch;
i++;
if (i==end-start) break;
}
}
string[end-start+1]='\0';
It created the following file: TGTTCCAGGCTGTCAGATGCTAACCTGGGGTCACTGGGGGTGTGCGTGCTGCTCCAGCCTGTTCCAGGATATCAGATGCTCACCTGGGGô
(without any line breaks, which I don't mind). Again with each run, I get a different random character instead of 'G'.
What am I doing wrong? Is there a way to get it done with fread or some other function?
Thank you in advance.