8

I'm writing code that reads huge text files containing DNA bases and I need to be able to extract specific parts. The file looks like this:

TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGGGG

...

Every line is 30 characters.

I have a separate file indicating these parts, meaning I have a start value and an end value. So for each start and end value, I need to extract the corresponding string in the file. For example, if I have start=10, end=45, I need to store the string which starts at the 10th character of the first line (C) and ends at the 15th character of the 2nd line (C) in a separate temporary file.

I tried using the fread function as seen below for a test file with the above lines of letters. The parameters were start=1, end=90 and the resulting file looks like this:

TGTTCCAGGCTGTCAGATGCTAACCTGGGG
TCACTGGGGGTGTGCGTGCTGCTCCAGCCT
GTTCCAGGATATCAGATGCTCACCTGGG™eRV

Each run will give random characters at the end.

The code:


FILE* fp;
fp=fopen(filename, "r");
if (fp==NULL) puts("Failed to open file");

int start=1, end=90;
char string[end-start+2]; //characters from start to end = end-start+1

fseek(fp, start-1, SEEK_SET);

fread(exon,1, end-start+1, fp);

FILE* tp;
tp=fopen("exon", "w");
if (tp==NULL) puts("Failed to make tmp file");

fprintf(tp, "%s\n", string);
fclose(tp);

I couldn't understand how fread handles \n characters so I tried replacing it with the following:

int i=0;
char ch;
while (!feof(fp))
{
            ch=fgetc(fp);

            if (ch != '\n') 
            {
                string[i]=ch;
                i++;
                if (i==end-start) break;
            }

}
string[end-start+1]='\0';

It created the following file: TGTTCCAGGCTGTCAGATGCTAACCTGGGGTCACTGGGGGTGTGCGTGCTGCTCCAGCCTGTTCCAGGATATCAGATGCTCACCTGGGGô

(without any line breaks, which I don't mind). Again with each run, I get a different random character instead of 'G'.

What am I doing wrong? Is there a way to get it done with fread or some other function?

Thank you in advance.

alinsoar
  • 15,386
  • 4
  • 57
  • 74
Kostis L
  • 83
  • 2
  • 2
    You have to take into account 31 characters per line (30 letters followed by `\n`), or possibly even 32 characters per line (30 letters followed by `\r\n`). Which means that you might want to check the format of your input file to begin with. And regardless of that, it's probably best to use `fseek` then `fread`. – goodvibration Jun 19 '19 at 12:55
  • 3
    FWIW, `fread` doesn't care about EOL characters at all. – Sean Bright Jun 19 '19 at 12:55
  • 2
    [While is while (!feof(fp)) always wrong](https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong). `fread` doesn't "specially" handle newline charaacters, it's just a character. Also it returns number of of read characters and the resulting data are not null terminated. – KamilCuk Jun 19 '19 at 12:57
  • 2
    I think there are two problems here: (1) You aren't taking into account that each line ends with a newline, which is a character. So to read 2 lines, you need to read 30 + 1 + 30 characters = 61 characters, not 60. You probably also want to strip out the newlines, and add your own back to it after every 30 characters. And (2) you aren't adding a null character to the end of your buffer, so when you try to print it as a string, it's going right past the end until it happens to encounter a random zero byte in memory. – Tom Karzes Jun 19 '19 at 12:57
  • In your own loop using `fgetc` you add the null at the end of the string, but I think your indexes are off -- you should add it to the position of `i` when you `break`. – vlumi Jun 19 '19 at 12:59
  • If you're seeking in a file then you must open the file in *binary* mode. `fseek` works in text files only to seek to an offset that was returned by `ftell`. – Antti Haapala -- Слава Україні Jun 19 '19 at 13:00

1 Answers1

0

I have modified your code and added comments to it for explanation.

Please go through it. You have neglected the error checking, code has few undefined variables.

I have returned from the if block on failure, goto` would be more appropriate.

Please refer this comment for whether to add 1 char or 2 chars to start and end.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main()
{
        FILE* fp;
        // fp = fopen(filename, "r");
        // since the filename is undeclared i have used hard coded file name
        fp = fopen("dna.txt", "r");
        // Nothing wrong in performing error checking
        if (fp == NULL) {
                puts("Failed to open file");
                return -1; 
        }

        // Make sure start is not 0 if you want to use indices starting from 1
        int start = 1, end = 90; 

        // I would adjust the start and end index by adding count of '\n' or '\r\n' to the start and end
        // Here I am adjusting for '\n' i.e 1 char
        // since you have 30 chars so hardcoding it.
        int m = 1; // m depends on whether it is \n or \r\n
                   // 1 for \n and 2 for \r\n
        --start; --end; // adjusting indexes to be 0 based
        if (start != 0)
                start = start + (start / 30) * m;   // start will be 0
        if (end != 0)
                end = end + (end / 30) * m;         // start will be 93

        // lets declare the chars to read
        int char_to_read = end - start + 1;

        // need only 1 extra char to append null char
        // If start and end is going to change, then i would suggest using malloc instead of static buffer
        // because compiler cannot predict the memory to allocate to the buffer if it is dependent on external factor
        // char string[char_to_read + 1]; //characters from start to end = end-start+1

        char *string = malloc(char_to_read + 1); 
        if (string == NULL) {
                printf("malloc failed\n");
                fclose(fp);
                return -2;
        }

        // zero the buffer
        memset(string, 0, char_to_read + 1); 

        int rc = fseek(fp, start, SEEK_SET);
        if (rc == -1) {
                printf("fseek failed");
                fclose(fp);
                return -1;
        }

        // exon is not defined, and btw we wanted to read in string.
        int bytes_read = fread(string, 1, char_to_read, fp);

        // Lets check if there is any error after reading
        if (bytes_read == -1) {
                fclose(fp);
                return -1; 
        }

        // Now append the null char to the end
        string[bytes_read] = 0;
        printf("%s\n", string);
        fclose(fp);

        // free the memory once you are done with it
        if (string)
                free(string);


// Now u can write it back to file.
//      FILE* tp;
//      tp=fopen("exon", "w");
//      if (tp==NULL) puts("Failed to make tmp file");

//      fprintf(tp, "%s\n", string);
//      fclose(tp);
}
Shubham
  • 628
  • 1
  • 9
  • 19
  • 1
    Thank you for your detailed answer! However, there's still one thing that still confuses me. Assume we have 90 characters in a single line and start=1(first character), end=90(last one). Then the characters are not end-start but end-start+1. So, if that's true, it should be char_to_read=end-start +1. What am I missing? Btw, sry for some undefined variables, they're either missing due to the code being only a part of a function, or because I forgot to change their names when I copy-pasted them (e.g. 'exon' is actually 'string'). – Kostis L Jun 21 '19 at 10:13
  • yes `char_to_read` should be `end - start + 1`. I forgot that in your case indexing starts with 1. I'll Change the code. – Shubham Jun 21 '19 at 10:46
  • there was an issue with index of `start` and `end` fixed it by subtracting 1 from them – Shubham Jun 21 '19 at 10:52
  • Instead of adjusting the values of `start` and `end`, you may do `int char_to_read = end - start; // it's a difference -> don't care if it's 0 or 1 indexed`, then add to that the number of newlines chars (`char_to_read / 30`) + 1. After the read, the `\n`s should be removed (or loop over the string writing each line to the file, without the newline char. – Mance Rayder Jun 22 '19 at 22:45
  • 1
    @Shubham the edited `start` will now be 0, so shouldn't it be `fseek(fp, start, SEEK_SET`) in this example (start=1, end=90)? Also, my start/end values are generally not 0, so I'm guessing that `start+=(start/30)*m` (same with end) & `char_to_read=end-start+1` will work fine right? – Kostis L Jun 24 '19 at 09:51
  • 1
    yes, you are right about fseek it should be `fseek(fp, start, SEEK_SET)`. If `start` and `end` are not zero then it will work. I updated the answer, thanks for the pointing out – Shubham Jun 24 '19 at 09:57