0

I have strings that have HTML tags in them (e.g.: "<p>sample_text</p>"). I would like to remove these tags from the strings as seen in the pseudo-code below:

string(string input_string)
{
    int i = 0
    bool is_deleting = False
    
    while(i < length(input_string))
    {
         if(input_string[i] == "<")
         {
             is_deleting = True
         }
         
         if(is_deleting == True)
         {
             if(input_string[i] == ">")
             {
                 is_deleting = False
             }
             input_string[i] = ""
         }
         i += 1
     }
return input_string
}

How could I make this work?

cigien
  • 57,834
  • 11
  • 73
  • 112
Mate96
  • 3
  • 5
  • You don't want to reinvent the wheel to parse html. See [Parse html using C](https://stackoverflow.com/questions/1527883/parse-html-using-c) and particularly gumbo-parser. While you can use a pair of pointers for a simple case, or tools like `strstr()`, etc.. if you have anything more than a simple case, use a validated parser. – David C. Rankin Oct 08 '20 at 21:43
  • Here is a version or yours that has the absolute minimum changes needed to make it work: https://onlinegdb.com/B14QjWp8P The reason we don't set input_string[i] to nothing when deleting is that would still leave an empty spot in the string. Instead, we need to shift all the characters to the right over to get rid of the empty spot. But we don't want to shift the remainder of the string every time a character is deleted so instead, we just copy each character to where it belongs unless we are deleting. – Jerry Jeremiah Oct 08 '20 at 21:57
  • Possible duplicate: https://stackoverflow.com/questions/9444200/c-strip-html-between – Jerry Jeremiah Oct 08 '20 at 22:06
  • @JerryJeremiah - I'd post your minimal changes as another answer if you like. – David C. Rankin Oct 08 '20 at 22:34

2 Answers2

1

You are thinking in the right direction, you have just confused the logic for deleting. In your case where you consider the tags to be is_deleting you only want to copy characters when not deleting.

Rather than considering if your condition is_deleting why not consider whether you are intag. At least when iterating over characters, being either in at tag ignoring characters or not in a tag copying characters seems a bit more descriptive.

Regardless you have 3 conditions for the current character. It is either (1) a '<' indicating a tag-opening where you set your intag flag true, or (2) the intag flag is true and the current character is '>' marking the close of the tag, or (3) intag is false and you are copying characters. You can implement that logic as follows:

When looping over the characters in any string, there is no need to take the strlen(). The nul-terminating character marks the end of the string for you.

If you put that together, you could do:

#include <stdio.h>

char *rmtags (char *s)
{
    int intag = 0,                      /* flag in-tag 0/1 (false/true) */
        write = 0;                      /* write index */
    
    for (int i = 0; s[i]; i++) {        /* loop over each char in s */
        if (s[i] == '<')                /* tag opening? */
            intag = 1;                  /* set intag flag true */
        else if (intag) {               /* if inside a tag */
            if (s[i] == '>')            /* tag close */
                intag = 0;              /* set intag false */
        }
        else                            /* not opening & not in tag */
            s[write++] = s[i];          /* copy to write index, increment */
    }
    s[write] = 0;                       /* nul-terminate s */
    
    return s;                           /* convenience return of s */
}

int main (void) {
    
    char s[] = "<p>sample_text</p>";
    
    printf ("text: '%s'\n", rmtags (s));
}

(note: You don't want to reinvent the wheel to parse html. See Parse html using C and particularly gumbo-parser. In this limited simple example -- it is trivial, but nested tags spanning multiple lines wildly complicate this endeavor quickly. Use a library that validates html)

Example Use/Output

$ ./bin/html_rmtags
text: 'sample_text'
David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
0
char *removetags(char *str, char opentag, char closetag)
{
    char *write = str, *read = str;
    int remove = 0;

    while(*read)
    {
        if(*read == closetag && remove)
        {
            read++;
            remove = 0;
        }
        if(*read == opentag || remove)
        {
            read++;
            remove = 1;
        }
        else
        {
            *write++ = *read++;
        }
    } 
    *write = 0;
    return str;
}
0___________
  • 60,014
  • 4
  • 34
  • 74