0

I have an input file with the following text.

<html>
<head><title>My web page</title></head>
<body>
<p>Foo bar<br />
Hi there!<br />
How is it going?
</p>
<p>
I'm fine.  And you?
</p>
<p>
Here is a <a href="somelink.html">link</a> to follow.
</p>
</body>
</html>

I am tasked with removing the html tags and if <br /> output one \n and if it is <p> "output two \n. My code works fine. Except it is counting </p> as a <p> and I do not want to output a \n for <\p>. I have been racking my brain for the last hour thinking of a way to account for this and I cannot. Might someone offer a suggestion on accounting for this.

void main(){

  FILE *ifp, *ofp;//input/output file pointers
  int c, c1, c2;//variables used to store and compare input characters, c2 is used only to check for a </p> tag
  int n = 0;
  int count = 0;//counter for total characters in file
  int putCount = 0;//counter for number of outputted characters
  int discardTag = 0; //counter to store number of discarded tags
  float charDiff = 0;//variable to store file size difference
  int br = 0; //counter for <br />
  int p = 0;//counter for <p>
  ifp = fopen("prog1in1.txt", "r");
  ofp = fopen("prog1in1out.txt", "w");

  do{
    c = getc(ifp);
    count ++;
    //compares the current character to '<' if its found starts a while loop
    if(c == '<'){
      //loops until it reaches the end of the tag
      while( c != '>'){
        count ++;
        c = getc(ifp);

        /*compares the first two characters to determine if it is a <br /> tag 
          if true outputs a null line and counts the number of <br /> tags*/
        if(c == 'b' ){
          c = getc(ifp);
          count ++;
          if( /*c == 'b' &&*/ c == 'r'){
            br ++;
            c = '\n';
            putc( c , ofp);
            count += 1;
          }

        }//end br if


        /*else if if the tag is <p> outputs two null lines 
          and counts the number of <p> tags*/
        else if ( c == 'p' ){
          p ++;
          c = '\n';
          putc( c ,ofp);
          putc( c, ofp);
          count +=2;

        }//end p if

        //counts the number of tags that are not <br />             
        else{ //if ( c2 != 'b' && c1 != 'r' || c1 != 'p'){
          discardTag ++;
        }// end discard tag
      }//end while

    }

    /*checks if the current character is not '>' 
      if true outputs the current character*/
    if( c != '>'){
      putc( c , ofp);
      putCount++;
    }
    else if( c == EOF){

      //does nothing here yet 
    }

  }while(c != EOF);
  fclose(ifp);

}//end main
keshlam
  • 7,931
  • 2
  • 19
  • 33
Azethoth
  • 2,276
  • 1
  • 13
  • 10
  • **Please** (a) format your code properly and (b) do not use `void main` – Paul R Feb 09 '14 at 21:56
  • I tried to format it properly for posting, and my instructor prefers void main so that's why I used it. If you are referring to the formatting of the code in visual studio, I always clean it up when I am done. – Azethoth Feb 09 '14 at 22:05
  • Does the code above look properly formatted to you ? Also it sounds like your instructor needs to read a good book on C if he prefers `void main`. – Paul R Feb 09 '14 at 22:15
  • Other than excessive white space, yes everything seems to be properly indented and nested and commented. Perhaps you would be willing to explain where the formatting is incorrect. I suppose you could be referring to the html code, that is simply copy pasta from the text file. – Azethoth Feb 09 '14 at 22:25
  • Your code checks one character at a time inside `<..>`, that's why you catch `` as well. It will also catch the "p" in `
    ` and `` and the "br" in ``. You might want to read the entire "word" right after `<` to prevent that (this also solves your current problem).
    – Jongware Feb 09 '14 at 23:01
  • I was thinking along those lines but was trying to keep the code as light and simple as possible and had not considered other tags containing similar combinations other than . That may be the best solution. Thank you. My apologies I am still not familiar with all the functions of this site. – Azethoth Feb 09 '14 at 23:08
  • Indentation fixed. (We can quibble about microstyle issues, but at least this follows one of the standard conventions.) – keshlam Feb 09 '14 at 23:43
  • Note that Microsoft C explicitly supports `void main()` as an implementation-defined alternative. See [What should `main()` return in C and C++?](http://stackoverflow.com/questions/204476/what-should-main-return-in-c-and-c/18721336#18721336) – Jonathan Leffler Feb 09 '14 at 23:46

1 Answers1

2

Suggestions:

  1. Use a buffer to read HTML tags in full. The buffer can be reasonably small -- there are no HTML tags longer than 8 characters (<noframe>), and even if there are, you are only interested in <p> and <br> anyway.

  2. Avoid long if..else if..else.. constructions when possible. HTML can be expressed in a fairly concise finite state machine. For your purpose, you need 2 parts: reading and writing plain text, and parsing any HTML command. The plain text can be processed per character; for the HTML command, you need at least its full tag name.

  3. What should be done with existing spaces and hard returns? Unfortunately, the HTML rules are not 100% clear on that point. Tabs and hard returns are considered a single space; multiple spaces are concatenated into a single one. However, the rules for space in between opening and closing HTML tags -- your </p>\n<p> sequences, for example -- are less strictly defined. This issue manifests itself as 'unexpected spaces' at the start and end of text strings.

Using the above, here is a total rewrite. By way of demonstration, it suppresses multiple spaces where possible and contains a special treatment of <PRE> blocks. It does not check for edge cases, such as non-matching tags, unexpected spaces inside <..> (< p > is valid HTML), and stray < characters inside your text (which are usually silently fixed by modern browsers).

#include <stdio.h>
#include <string.h>
#include <ctype.h>

int main (void)
{
    FILE *ifp, *ofp;//input/output file pointers
    int c, c1, c2;//variables used to store and compare input characters, c2 is used only to check for a </p> tag
    int n = 0;
    int count = 0;//counter for total characters in file
    int putCount = 0;//counter for number of outputted characters
    int discardTag = 0; //counter to store number of discarded tags
    float charDiff = 0;//variable to store file size difference
    int br = 0; //counter for <br />
    int p = 0;//counter for <p>
    ifp = fopen("prog1in1.txt", "r");
    ofp = fopen("prog1in1out.txt", "w");

    char html_tag_buf[32];
    int len_html_tag;

    int inside_pre_block = 0;
    int wrote_a_space = 0;

    do
    {
        c = getc(ifp);
        count ++;
        //compares the current character to '<' if its found starts a while loop
        switch (c)
        {
            case EOF:
                break;

            // both newline and tab are considered a single space in HTML
            // HTML does not support multiple spaces, except in PRE../PRE
            case '\n': case '\t': case ' ':
                if (inside_pre_block)
                {
                    putc(c , ofp);
                    putCount++;
                    break;
                }
                if (!wrote_a_space)
                {
                    wrote_a_space = 1;
                    putc( ' ' , ofp);
                    putCount++;
                }
                break;

            case '<':
                wrote_a_space = 0;

                //loops until it reaches the end of the tag
                len_html_tag = 0;

                while( c != '>' && c != ' ' && c != '\n' && c != '\t')
                {
                    c = getc(ifp);
                    count++;
                    if (c == EOF)
                        break;

                    if (c != '>' && c != ' ' && c != '\n' && c != '\t')
                    {
                        html_tag_buf[len_html_tag] = toupper(c);
                        len_html_tag++;
                        if (len_html_tag > 30)
                            break;
                    }
                }
                while (c != '>')
                {
                    c = getc(ifp);
                    count++;
                }
                html_tag_buf[len_html_tag] = 0;
                printf ("<%s>", html_tag_buf);

                if (!strcmp (html_tag_buf, "P"))
                {
                    wrote_a_space = 1;
                    putc('\n' , ofp);
                    putc('\n' , ofp);
                } else
                if (!strcmp (html_tag_buf, "BR"))
                {
                    wrote_a_space = 1;
                    putc('\n' , ofp);
                } else
                {
                    if (!strcmp (html_tag_buf, "PRE"))
                        inside_pre_block = 1;
                    if (!strcmp (html_tag_buf, "/PRE"))
                        inside_pre_block = 0;

                    //counts the number of tags that are not <br />
                    discardTag ++;
                }

                break;

            default:
                wrote_a_space = 0;
                putc( c , ofp);
                putCount++;
        }
    } while(c != EOF);
    fclose(ifp);
} //end main

Output for your test file (note the extraneous spaces):

·My·web·page··

Foo·bar
Hi·there!
How·is·it·going?··

I'm·fine.·And·you?··

Here·is·a·link·to·follow.···

(center dots · indicate spaces).

Jongware
  • 22,200
  • 8
  • 54
  • 100