0

So my code is for a program that reads tags in an html file and displays them along with the count of how many times they have occurred. For this problem, a tag is considered to be one that begins immediately after '<' with alphanumeric name and terminates with either '>' or a space.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>

#define MAX_TAG_LEN 10
#define MAX_TAGS 100

void htagsA3()
{
    char c;
    int within_tag = 0;
    char tagName[MAX_TAG_LEN];
    int tagNameLen = 0;

    char tags[MAX_TAGS][MAX_TAG_LEN];    //stores tag names
    int tagCounts[MAX_TAGS];              //stores count of each tag
    int numOfTags = 0;
    
    while((c = getchar()) != EOF)
    {
        if(c == '<')
        {
            within_tag = 1;
            tagNameLen = 0;
        }
        else if(c == '>' || c == ' ')
        {
            within_tag = 0;
            tagName[tagNameLen] = '\0';
    
            int i;
            for(i=0; i<numOfTags; i++)
            {
                if(strcmp(tags[i], tagName) == 0)
                {
                    tagCounts[i]++;
                    break;
                }
            }
    
            if(i == numOfTags)
            {
                strncpy(tags[numOfTags], tagName, MAX_TAG_LEN);
                tagCounts[numOfTags] = 1;
                numOfTags++;
            }
        }
        else if(within_tag)
        {
            while(c != ' ')
            {
                if(isalnum(c) && tagNameLen < MAX_TAG_LEN)
                {   
                    tagName[tagNameLen] = c;
                    tagNameLen++;
                }
            }
        }
    }
    
    printf("HTML Tags Found:\n");
    int i;
    for(i=0; i<numOfTags; i++)
    {
        printf("%s: %d\n", tags[i], tagCounts[i]);
    }
}

int main()
{
    htagsA3();
}

I want to be able to add the tag name until a space is seen so I used while(c != ' '). When I compile and run this, the cmd gets stuck on a blank line. Without the while loop, the program works fine but displays the right tag name but the count is wrong as the tag counter is incremented even in spaces and I only want to count how many times a particular tag has appeared. I am using input redirection to input an html file to the program when running. Please help me find the errors.

Here is a sample output:

HTML Tags Found: 
body: 4
div: 2
p: 6
b: 4
span: 16

The correct count should actually be:

body 1 div 1 p 2 b 2 span 2

Here is the content of the sample html file inputted:

<body lang=EN-CA link=blue vlink="#954F72">
<div class=WordSection1>
<p class=MsoNormal><b><span lang=EN-US style='font-size:14.0pt;font-family: "Times New Roman",serif'>CS 2263</span></b></p>
<p class=MsoNormal><b><span lang=EN-US style='font-size:14.0pt;font-family: "Times New Roman",serif'>Assignment 1</span></b></p>
Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
  • `c` should be an `int` to properly detect `EOF`. Are you generating an `EOF` to end the loop? `ctrl+z` then enter on an empty line in Windows or `ctrl+d` on Mac/Linux. `while(c != ' ')` is an infinite loop since `c` does not change inside it. – Retired Ninja Mar 22 '23 at 21:12
  • @RetiredNinja: As stated in the question, OP is using input redirection so that the posted HTML file is on `stdin`. Therefore, the issue of having to press `CTRL-Z` or `CTRL-D` to generate `EOF` does not apply. – Andreas Wenzel Mar 22 '23 at 21:14
  • If the infinite loop is removed it seems to work. https://godbolt.org/z/Kns8YKh68 I missed the redirection in the long story. – Retired Ninja Mar 22 '23 at 21:18
  • Changing the declaration of `c` to `int` should fix the problem with the main loop. – Barmar Mar 22 '23 at 21:22
  • `while(c != ' ')` should probably be `if(c != ' ')` – Barmar Mar 22 '23 at 21:24
  • It is important for `c` to be an `int` and not a `char`, because the data type `char` is not guaranteed to be able to represent the value `EOF`, whereas an `int` is guaranteed to be able to represent that value. `EOF` is not a character code, but a special value outside the range of character codes. That is why `getchar` returns an `int`, not a `char`. – Andreas Wenzel Mar 22 '23 at 21:30
  • Please see the updated question, as a mistake was made in the code and query. – goldengomi Mar 22 '23 at 21:52

1 Answers1

0

Your posted program has several problems:

  1. A char is not guaranteed to be able to represent the value EOF. Therefore, the line while((c = getchar()) != EOF) is not guaranteed to work. You should change the data type of c from char to int. See this question for further information: Why must the variable used to hold getchar's return value be declared as int?
  2. The line while(c != ' ') will effectively create an infinite loop, because the value of c will not change inside that loop.
  3. You are counting the tags multiple times. You count the tag for every ' ' character encountered in the tag, in addition to the '>' character. This is clearly visible when you run your program line by line in a debugger while monitoring the values of all variables.

Problem #1 can be solved by changing the line

char c;

to:

int c;

Problem #2 can be solved by removing the line while(c != ' '), as the content of that loop should be executed exactly once (i.e. the loop content should not be inside a loop).

Problem #3 can be solved by only counting the tag when c == '>', and not when c == ' '.

Here is your fixed code:

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>

#define MAX_TAG_LEN 10
#define MAX_TAGS 100

void htagsA3()
{
    int c;
    int within_tag = 0;
    char tagName[MAX_TAG_LEN];
    int tagNameLen = 0;

    char tags[MAX_TAGS][MAX_TAG_LEN];    //stores tag names
    int tagCounts[MAX_TAGS];              //stores count of each tag
    int numOfTags = 0;

    while((c = getchar()) != EOF)
    {
        if(c == '<')
        {
            within_tag = 1;
            tagNameLen = 0;
        }
        else if (c == '>' || c == ' ')
        {
            within_tag = 0;
            tagName[tagNameLen] = '\0';

            if ( c == '>' )
            {
                int i;
                for(i=0; i<numOfTags; i++)
                {
                    if(strcmp(tags[i], tagName) == 0)
                    {
                        tagCounts[i]++;
                        break;
                    }
                }
    
                if(i == numOfTags)
                {
                    strncpy(tags[numOfTags], tagName, MAX_TAG_LEN);
                    tagCounts[numOfTags] = 1;
                    numOfTags++;
                }
            }
        }
        else if(within_tag)
        {
            if(isalnum(c) && tagNameLen < MAX_TAG_LEN)
            {   
                tagName[tagNameLen] = c;
                tagNameLen++;
            }
        }
    }
    
    printf("HTML Tags Found:\n");
    int i;
    for(i=0; i<numOfTags; i++)
    {
        printf("%s: %d\n", tags[i], tagCounts[i]);
    }
}

int main()
{
    htagsA3();
}

However, after fixing these problems, the program still does not give the desired output. It outputs the following:

HTML Tags Found:
body: 1
div: 1
p: 4
b: 4
span: 4

The reason for this is that your program counts the opening tags and the closing tags as duplicates of the same tag. This is because you are filtering the tag names through the function isalnum, so that the / in the closing tag names gets dropped, making the names of the closing tags identical to those of the opening tags. If you remove this filter, by changing the line

if(isalnum(c) && tagNameLen < MAX_TAG_LEN)

to

if(tagNameLen < MAX_TAG_LEN)

then your program has the following output:

HTML Tags Found:
body: 1
div: 1
p: 2
b: 2
span: 2
/span: 2
/b: 2
/p: 2

As you can see, the closing tags are now counted separately.

EDIT: In the comments section of this answer, you stated that you wanted to use the function isalnum to determine whether the tag contained only alphanumeric characters, and if not, you want to ignore it. In that case, you must change the program to ignore the entire tag, instead of only the non-alphanumeric characters:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <stdbool.h>

#define MAX_TAG_LEN 10
#define MAX_TAGS 100

void htagsA3( void )
{
    bool within_tag = false;
    bool within_tag_name = false;
    bool ignore_tag;

    char tagName[MAX_TAG_LEN];
    int tagNameLen = 0;

    char tags[MAX_TAGS][MAX_TAG_LEN];    //stores tag names
    int tagCounts[MAX_TAGS];              //stores count of each tag
    int numOfTags = 0;

    int c;
    
    while ( (c=getchar()) != EOF )
    {
        switch ( c )
        {
            case '<':
            {
                //verify that there was no syntax error
                if ( within_tag )
                {
                    fprintf( stderr, "Error: Encountered '<' within another tag!\n" );
                    exit( EXIT_FAILURE );
                }

                //initialize variables for processing new tag
                within_tag = true;
                within_tag_name = true;
                ignore_tag = false;
                tagNameLen = 0;

                break;
            }
            case ' ':
            {
                //specify that we are no longer parsing the tag name
                within_tag_name = false;

                break;
            }
            case '>':
            {
                //verify that there was no syntax error
                if ( !within_tag )
                {
                    fprintf( stderr, "Error: Encountered '>' outside tag!\n" );
                    exit( EXIT_FAILURE );
                }

                //specify that we are no longer processing a tag
                within_tag = false;
                within_tag_name = false;

                if ( !ignore_tag )
                {
                    int i;

                    //add null terminating character to tag name
                    tagName[tagNameLen] = '\0';

                    //determine whether tag already exists in list
                    for( i = 0; i < numOfTags; i++ )
                    {
                        if( strcmp( tags[i], tagName ) == 0 )
                        {
                            tagCounts[i]++;
                            break;
                        }
                    }

                    //if tag did not already exist in list, add it to list
                    if ( i == numOfTags )
                    {
                        if ( numOfTags == MAX_TAGS )
                        {
                            fprintf( stderr, "Error: Too many tags!\n" );
                            exit( EXIT_FAILURE );
                        }

                        strncpy( tags[numOfTags], tagName, MAX_TAG_LEN );
                        tagCounts[numOfTags] = 1;
                        numOfTags++;
                    }
                }
                break;
            }
            default:
            {
                if ( within_tag_name )
                {
                    //add character to tagName
                    tagName[tagNameLen++] = c;

                    //verify that tag is not too long
                    if ( tagNameLen == MAX_TAG_LEN )
                    {
                        fprintf( stderr, "Error: Tag name too long!\n" );
                        exit( EXIT_FAILURE );
                    }

                    //mark tag as to be ignored, if appropriate
                    if ( !isalnum( (unsigned char)c ) )
                    {
                        ignore_tag = true;
                    }
                }
            }
        }
    }

    //print results
    printf( "HTML Tags Found:\n" );
    for( int i = 0; i < numOfTags; i++ )
    {
        printf( "%s: %d\n", tags[i], tagCounts[i] );
    }
}

int main( void )
{
    htagsA3();
}

This program has the desired output:

HTML Tags Found:
body: 1
div: 1
p: 2
b: 2
span: 2
Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
  • Thank you so much, your answer is a lifesaver! I wanted to ask a question regarding the last part. What if I do not want to consider the closing tags as tags, only the opening tags? – goldengomi Mar 22 '23 at 23:26
  • @goldengomi: You have two options for that: (1) You can process the closing tags the same way as the opening tags and then filter them out later, or (2) you can prevent the entire tag from being processed and stored, by checking whether the first character after the `<` is a `/` character, and if that is the case, skip everything until you encounter the `>` character. – Andreas Wenzel Mar 22 '23 at 23:39
  • So I made all the changes, and for the filtering, I chose two options: Using an if statement to not print any closing tags Using two nested for loops to check for any characters that are not alphanumeric and changing the value of a boolean variable to 0 if there are, then I print only those with boolean variable 1(ie. all characters are alphanumeric) (continued in next comment) – goldengomi Mar 23 '23 at 18:59
  • When I run my program on a larger html file, the first option gets me all opening tags even the ones with special characters whereas the second option only gets me a few of the tags with no special characters even when the old output clearly had many more. So I wanted to know how I can only consider tags with no special characters to be tags as that is supposed to be the criteria for the question. @Andreas Wenzel – goldengomi Mar 23 '23 at 18:59
  • @goldengomi: I have now added an alternative solution which I believe does what you want. – Andreas Wenzel Mar 26 '23 at 23:26