0

I am trying to write an elementary XML parser in C, without using any non-standard libraries, which will be able to:

  • detect several different tags
  • detect an empty tag
  • detect tag mismatch

The main problem I have is how to differ which is which: beginning of the tag, content and ending of the tag.

My idea was to implement a finite-state machine while reading the file in order to know what I am reading.

Please tell me your ideas and correct me if I am pointed into the wrong direction.

EDIT: added a chunk of code that detects the elements and content

char tmp, buff = -1;
char *content = (char*) malloc(sizeof(char) * (size + 1));
int stage = -1;
int i = 0;
while((tmp = fgetc(file)) != EOF) {
    if(tmp == '<') {
        if(stage == 2 && buff != '>'){
            printf("content: ");
            printCont(content,i);
        }
        stage = 1;
        buff = tmp;
        i = 0;
        continue;
    }else if(tmp == '/' && buff == '<') {
        stage = 3;
        buff = tmp;
        i = 0;
        continue;
    } else if(tmp == '>') {
        if (stage == 1) {
            printf("tag_start: ");
        } else if (stage == 3) {
            printf("tag_end: ");
        } else if (stage == 2) {
            printf("content: ");
        }
        buff = tmp;
        printCont(content,i);//reads the contnet
        stage = 2;
        i = 0;
        continue;
    }
    if(tmp != ' ' && tmp != '\n' && tmp != '\t') {//simple filter
        content[i] = tmp;
        buff = tmp;
        i++;
    }
}

I would be really greatful if you could comment me on the code above and tell me how to improve it. So far it detects the tags and the content, which is what I really needed in the first place.

Mark
  • 253
  • 1
  • 5
  • 22
  • 1
    Check out existing XML parsers for guidance. I recommend you also read up on the `tree` data structure. [libxml2](http://www.xmlsoft.org/) – aglasser May 16 '14 at 18:21
  • "Please tell me your ideas" ok. unless you have a **very** compelling reason to do otherwise, [**consider these options**](http://stackoverflow.com/questions/9387610/what-xml-parser-should-i-use-in-c/9387612#9387612) before reinventing the wheel. – WhozCraig May 16 '14 at 18:21
  • 1
    as a computer science student I was asked to make my own parser, it is a part of the subject I have on programming in C I have already read a lot of threads on stackoverflow concerning this matter and all of them were pointing towards using libraries, but as I wrote in the question I need to do it on my own without and non-standard libraries. I kindly asked for your advice, so please help me if you are willing to and if you do not want to help me, then please leave some space for others that are actually willing to help – Mark May 16 '14 at 18:33
  • It sounds like what you are looking for is a textbook on parsing, such as [the Dragon Book](https://www.powells.com/biblio/65-9780321486813-0). It is not cheap, but your library should have a copy. – zwol May 16 '14 at 18:58
  • 1
    @Mark, you are on the right track; in asking questions. Unfortunately, asking for a discussion (for ideas, etc.) here at SO is apparently tabu. I maintain C code as a profession; (haven't been in school for about 30 years). I come across "Elementry XML parser" implementations all the time; probably due to the excess baggage that comes with many XML parsers. For example, some applications implement configuration files in XML; but it is not the focus of the application. For these, it is usually easier to write a small XML parser sufficient to read the configuration file. Not uncommon at all. – Mahonri Moriancumer May 16 '14 at 19:23

1 Answers1

1

An FSM, by itself, is not enough. You will need one to break the text up into tokens as specified by the XML spec, but you'll need to use other techniques to actually recognize valid XML (or reject invalid XML).

You'll then need to write a basic recursive descent parser that will take those tokens and use them to recognize valid XML.

This sounds like a basic enough assignment that you don't have to worry about 80% of what's in the XML spec, but make sure you understand start tags and end tags. Even so, this is going to be a non-trivial amount of work.

John Bode
  • 119,563
  • 19
  • 122
  • 198
  • +1 indeed this is accurate, and probably the most helpful advice the OP will get for such a broad question. If I had anything constructive at all to add it would be to exploit as much as possible the C++ standard lib when writing this. The algorithms, sequence or otherwise, that it offers are *wonderful*, and as part of the standard, you can reliably use them without losing points for going outside the box. – WhozCraig May 17 '14 at 04:07
  • thank you for the useful advice! I just added a chunk of code, would you like to take a look at it? – Mark May 17 '14 at 11:26