0

I need remove punctuation from a given string or a word. Here's my code:

void remove_punc(char* *str)
{
    char* ps = *str;
    char* nstr;
    // should be nstr = malloc(sizeof(char) * (1 + strlen(*str)))
    nstr = (char *)malloc(sizeof(char) * strlen(*str));
    if (nstr == NULL) {
        perror("Memory Error in remove_punc function");
        exit(1);
    }
    // should be memset(nstr, 0, sizeof(char) * (1 + strlen(*str)))
    memset(nstr, 0, sizeof(char) * strlen(*str));
    while(*ps) {
        if(! ispunct(*ps)) {
            strncat(nstr, ps, 1);
        }
        ++ps;
    }
    *str = strdup(nstr);
    free(nstr);
}

If my main function is the simple one:

int main(void) {
    char* str = "Hello, World!:)";
    remove_punc(&str);
    printf("%s\n", str);
    return 0;
}

It works! The output is Hello World.

Now I want to read in a big file and remove punctuation from the file, then output to another file. Here's another main function:

int main(void) {
    FILE* fp = fopen("book.txt", "r");
    FILE* fout = fopen("newbook.txt", "w");
    char* str = (char *)malloc(sizeof(char) * 1024);
    if (str == NULL) {
        perror("Error -- allocating memory");
        exit(1);
    }
    memset(str, 0, sizeof(char) * 1024);
    while(1) {
        if (fscanf(fp, "%s", str) != 1)
            break;
        remove_punc(&str);
        fprintf(fout, "%s ", str);
    }
    return 0;
}

When I rerun the program in Visual C++, it reports a Debug Error! DAMAGE: after Normal Block(#54)0x00550B08, and the program is aborted.

So, I have to debug the code. Everything works until the statement free(nstr) being executed. I get confused. Anyone can help me?

wintr
  • 53
  • 12
  • `strlen` doesn't include the terminator, so sending a string to this function with *no* punctuation will guarantee an overwrite-by-one error, invoke undefined behavior, and if you're *lucky*, crash your process. – WhozCraig Oct 02 '14 at 05:03
  • it is not so effective to do a copy of your original text and write it to the new file, better would be just to read from the original text and then write char by char to the new file but skip any punctuation. That way you save the memory allocs. – AndersK Oct 02 '14 at 05:49
  • I need read a word, and remove punctuation from the word, then count the word in the book. – wintr Oct 02 '14 at 06:08

3 Answers3

2

You forgot to malloc space for the null terminator. Change

nstr = (char *)malloc(sizeof(char) * strlen(*str));

to

nstr = malloc( strlen(*str) + 1 );

Note that casting malloc is a bad idea, and if you are going to malloc and then memset to zero, you could use calloc instead which does just that.


There is another bug later in your program. The remove_punc function changes str to point to a freshly-allocated buffer that is just big enough for the string with no punctuation. However you then loop up to fscanf(fp, "%s", str). This is no longer reading into a 1024-byte buffer, it is reading into just the buffer size of the previous punctuation-free string.

So unless your file contains lines all in descending order of length (after punctuation removal), you will cause a buffer overflow here. You'll need to rethink your design of this loop. For example perhaps you could have remove_punc leave the input unchanged, and return a pointer to the freshly-allocated string, which you would free after printing.

If you go with this solution, then use %1023s to avoid a buffer overflow with fscanf (unfortunately there's no simple way to take a variable here instead of hardcoding the length). Using a scanf function with a bare "%s" is just as dangerous as gets.

Community
  • 1
  • 1
M.M
  • 138,810
  • 21
  • 208
  • 365
  • Thanks! It solved my problem. But it still doesn't run correctly for a large file. The worst thing is, no tips are given by the IDE. – wintr Oct 02 '14 at 05:12
  • @wintr now learn to use your debugger to find out where it is going wrong; and/or add in extra print statements so you can see what is going on. – M.M Oct 02 '14 at 05:21
1

The answer by @MatMcNabb explains the causes of your problems. I'm going to suggest couple of ways you can simplify your code, and make it less susceptible to memory problems.

  1. If performance is not an issue, read the file character by character and discard the puncuation characters.

    int main(void)
    {
       FILE* fp = fopen("book.txt", "r");
       FILE* fout = fopen("newbook.txt", "w");
       char c;
    
       while ( (c = fgetc(fp)) != EOF )
       {
          if ( !ispunct(c) )
          {
             fputc(c, fout);
          }
       }
    
       fclose(fout);
       fclose(fp);
    
       return 0;
    }
    
  2. Minimize the number of calls to malloc and free by passing in the input string as well as the output string to remove_punc.

    void remove_punc(char* inStr, char* outStr)
    {
       char* ps = inStr;
       int index = 0;
       while(*ps) 
       {
          if(! ispunct(*ps))
          {
             outStr[index++] = *ps;
          }
          ++ps;
       }
       outStr[index] = '\0';
    }
    

    and change the way you use remove_punc in main.

    int main(void)
    {
       FILE* fp = fopen("book.txt", "r");
       FILE* fout = fopen("newbook.txt", "w");
       char inStr[1024];
       char outStr[1024];
    
       while (fgets(inStr, 1024, fp) != NULL )
       {
          remove_punc(inStr, outStr);
          fprintf(fout, "%s", outStr);
       }
    
       fclose(fout);
       fclose(fp);
    
       return 0;
    }
    
R Sahu
  • 204,454
  • 14
  • 159
  • 270
0

In your main you have the following

char* str = (char *)malloc(sizeof(char) * 1024);
...
        remove_punc(&str);
...

Your remove_punc() function takes the address of str but when you do this in your remove_punc function

...
*str = strdup(nstr);
...

you are not copying the new string to the previously allocated buffer, you are reassigning str to point to the new line sized buffer! This means that when you read lines from the file and the next line to be read is longer than the previous line you will run into trouble.

You should leave the original buffer alone and instead e.g. return the new allocate buffer containing the new string e.g. return nstr and then free that when done with it or better yet just copy the original file byte by byte to the new file and exclude any punctuation. That would be far more effective

AndersK
  • 35,813
  • 6
  • 60
  • 86