-4

I am writing a program to parse a file to another format. The input file is 20Gb in size so I have turned to C for parsing it, however when my output file reaches 4.3Gb (this is around the 41 second mark) the program gives a segmentation fault.

When tailing the output file it shows me that it has stopped giving output mid writing.

The input file is located at ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/ where the file is zipped as idmapping.dat.gz

The program is expected to parse the whole file and not just give a segmentation error.

int main()
{
    char line[256];
    char placeholdertoken[256];
    char placeholderline[256];
    char *token1, *token2, *token3;
    char *chdup;
    char *tab, *newline, *semicolom, *empty;
    FILE *fp;
    FILE *fs;
    fp = fopen("idmapping.dat", "r");
    fs = fopen("parsedidmapping.dat", "w");

    if( fp == NULL )
    {
        perror("Error while opening the file.\n");
        exit(EXIT_FAILURE);
    }

    strcpy(tab,"\t");
    strcpy(newline,"\n");
    strcpy(semicolom,";");
    strcpy(empty,"");
    strcpy(placeholdertoken,"");

    while (fgets(line, sizeof(line), fp) != NULL)
    {

    token1 = strtok(line, "\t");
    token2 = strtok(NULL, "\t");
    token3 = strtok(NULL, "\n");

    if (strcmp(token1, placeholdertoken) == 0) {
        strcat(placeholderline, token2);
        strcat(placeholderline, semicolom);
        strcat(placeholderline, token3);
        strcat(placeholderline, tab);
    }
    else {
        strcat(placeholderline, newline);
        strcpy(placeholdertoken,token1);
        fputs(placeholderline, fs);
        strcpy(placeholderline, empty);
        strcat(placeholderline, token1);
        strcat(placeholderline, tab);
        strcat(placeholderline, token2);
        strcat(placeholderline, semicolom);
        strcat(placeholderline, token3);
        strcat(placeholderline, tab);
        }
    }

    fclose(fs);
    fclose(fp);
    return 0;
}
Sinshz
  • 13
  • 6
  • 1
    Post your code here . – ameyCU Nov 13 '15 at 13:47
  • You should post your source code here instead of merely linking to it, for posterity and convenience. – Magisch Nov 13 '15 at 13:51
  • Your `placeholderline` variable is never initialized, yet you're `strcat()`-ing and assigning characters to it in various places. I'm surprised this code ever works to begin with. – Paul Roub Nov 13 '15 at 13:52
  • 1
    while it just may be coincidence, 4.3 billion is the overflow point of a uint32_t – Russ Schultz Nov 13 '15 at 13:52
  • @RussSchultz I have indeed changed the compiler to compile for 64bit and it kept going up to 4.9Gb now, however it still gave a segmentation error. – Sinshz Nov 16 '15 at 12:20
  • Why are you using strtok on NULL? Should it be line perhaps? – technosaurus Nov 17 '15 at 11:13
  • @technosaurus I found that in another stackoverflow answer, it splits the line and somehow keeps doing that using that code. – Sinshz Nov 19 '15 at 10:49

2 Answers2

2

Your placeholdertoken[] is uninitialized. Your placeholderline is not allocated any memory.

I am surprised it is running for 41 seconds.

Haris
  • 12,120
  • 6
  • 43
  • 70
  • I have tried to initialize placeholdertoken in my code, however it keeps saying that i am trying to cast an integer to a char even when i use something like "Test", I have edited my code to allow placeholderline to have memory now. – Sinshz Nov 16 '15 at 08:24
  • @Sinshz, Please post the new code separately after the old one mentioning its an edit. I cannot make out what changes you made. – Haris Nov 16 '15 at 08:47
  • Edited with my changes, it now gives out an error: incompatible types when assigning to type ‘char[256]’ from type ‘char *’ placeholdertoken = "0"; I know this has something to do with how I declared the variable but I am learning C right now so I can't figure out what I keep doing wrong. – Sinshz Nov 16 '15 at 12:19
  • @Sinshz, What are you trying to do in this `*placeholderline = *empty;`? – Haris Nov 16 '15 at 12:30
  • the line needs to be emptied so the next set of data can be entered, this is because it tries to group data in the file based on the first ID. – Sinshz Nov 17 '15 at 09:35
  • @Sinshz, You cannot do like that, use `strcpy()` instead. – Haris Nov 17 '15 at 09:44
  • I have changed that part, I only get a warning that I am casting an integer at placeholdertoken = ""; however the program instantly gives a segementation fault at a run. – Sinshz Nov 17 '15 at 10:43
  • @Sinshz, Where did you do `placeholdertoken = "";`? Its not there in the code. – Haris Nov 17 '15 at 11:13
  • it is directly below the check if the file is empty – Sinshz Nov 18 '15 at 09:40
  • @Sinshz, ya you cannot do like that. Check this out http://stackoverflow.com/q/31808812/1795279 – Haris Nov 18 '15 at 11:03
  • @Sinshz, for that also use `strcpy()` – Haris Nov 18 '15 at 11:03
  • I have just updated my code, however it still gives a segmentation fault and I have tried to find it myself but to no avail. – Sinshz Nov 19 '15 at 10:50
  • I think you should ask another question with the new updates code. Changing the code like that makes all the answer useless. Please do not do that. – Haris Nov 19 '15 at 10:52
1

You write to placeholderline which is an uninitialized pointer. This is undefined behavior.

You also read placeholdertoken before writing to it.

unwind
  • 391,730
  • 64
  • 469
  • 606