2

Well, here is my problem: I have an application that uses a custom Javascript implementation, but no support for Regular Expressions.

However, I'd like to be able to parse templates nevertheless; preferably using C++.

A template might look like this (ASP-style template):

<% var foo = someFunction("with a string");
   var bar =  anotherFunction(["with", "an", "array"]); %>

<b>This is html, and this is a variable: <%= bar %></b>

<% if(foo) { %>
    <b> foo is 'true'</b>
<% } else { %>
    <b> foo is 'false'. terrible. </b>
<% } %>

So the general structure is pretty simple (and I'd assume, relatively parseable).

My question is, Is it possible to parse such a template with a while() loop, going through each character, instead of using regular expressions?

And since my attempts to do that failed horribly, how could it be done?

Thank you!

  • 2
    Yes, it's possible. If you want to do it yourself, you need to write a [parser](http://en.wikipedia.org/wiki/Parser). This can quickly get complex, so I strongly recommend using a library to help you (e.g. Boost.Spirit). – Oliver Charlesworth Jun 23 '12 at 15:34
  • 1
    @OliCharlesworth I think this question is about not using side libraries (since no RE library is supported) – madfriend Jul 01 '12 at 21:06
  • As you've now opened a bounty on this question, you should consider specifying what your constraints are. For instance, are you allowed to use *any* libraries? – Oliver Charlesworth Jul 01 '12 at 21:18
  • @OliCharlesworth Correct, I don't have access to any libraries that could remotely be helpful in this matter, however silly that may sound. –  Jul 01 '12 at 21:27
  • Could you be more specific about the need to avoid outside libraries? In particular, are you not allowed to use *any* code you didn't write yourself? – guyrt Jul 01 '12 at 21:30
  • Certainly it's possible. People write compilers and interpreters all the time. This project is in that vein. The question is how "full blown" a parser you need. A complete parser to generate detailed abstact syntax for javascript embedded in HTML is quite a job: not so much hard as lots of code and many tedious details. If you only need something less, like splitting HTML and javascript in some fashion, then the solution could be very much simpler. – Gene Jul 08 '12 at 17:56

4 Answers4

3

Such a template is quite easy to parse.

The key is recognizing that such templates basically consist of a sequence of just two kinds of strings: boilerplate (HTML) text, and script text.

The boiler plate text basically starts with "%>" and ends at "<%" (with special cases at begin-template and end-template). Script text is just everything else. Yes, you can pick off both with just a while loop for each one that watches for "<%", "%>" or "end-of-template". The sequence is implicit in alternating back and forth. That makes for pretty simple parser:

  while not eof
       boilerplate="";
       while next_characters~="<%" or eof
          boilerplate concat next_characters
       end
       scripttext="";
       while next_characters~="%>" or eof
          scripttext concat next_characters
       end
  end

(I leave the details of the individual character management for the coder).

What you didn't say is what you wanted to do with the parsed result. If the goal is to "generate output" from the parsed result, you'll have to turn it into a program. That's actually pretty easy.

Basically you write the result to a file and compile it. For each piece of collected boilerplate text, emit a print statement that prints the boilerplate text; you may have to escape the characters to make them legal in a string literal in your chosen target language, or break the boilerplate into multiple blocks to print it. For each block of script text, simply emit that unchanged. You'll likely have to emit a prolog text chunk to make a function header, and as postlog text chunk to make a function end.

That's it.

[Because of the trivial conversion between such "templates" and a simple program with print statements, I don't find such template programming to be very enticing. It saves me a few print keywords, and that's it.]

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Your answer essentially repeats what I already addressed "*My question is, Is it possible to parse such a template with a while() loop, going through each character, instead of using regular expressions?*" -- That is, I already know that this is indeed an option, alas, my attempts at doing it that way were fruitless. It cannot be helped, I'm afraid, and so I will have to look into other options. My apologies. –  Jul 08 '12 at 22:35
  • 1
    You asked if it is possible with while loops. It seems to me I answered your question exactly, including a sketch of the algorithm, and in fact went well beyond that to address what you might do with it. I don't know how to help you with "my attempts were frutless". At the risk of being a bit harsh, this is pretty simple code; if you can't get this right, you're not likely to have better luck with "other options". – Ira Baxter Jul 08 '12 at 23:53
  • 1
    ... the usual response to "this isn't working for me" is "what specifically did you try?" – Ira Baxter Jul 08 '12 at 23:57
1

This is what I would attempt in your case:

  1. write a tokenizer that returns a set of known tokens (with the exact character string they represent attached, e.g.: ID("someFunction"))

  2. write a formal grammar using the above tokens that describes the accepted template formats

  3. write a parser that recognizes the grammar (e.g. a push-down automaton, LR parser or LALR parser))

Note: make sure the grammar conforms to the limitations of the parser you are implementing; if it does not, re-write the grammar or change parsers

Note: make sure you test your parser thorougly as errors in the implementation can be hard to debug

Note: during the parsing steps you will need to perform some additonal operations as well if you want to get the semantics (meaning) of the parsed template, not just whether it is a valid template. These extra steps involve storing IDs for variables/functions (remember the string attached to the tokens?), looking them up when referenced, checking function parameter numbers, etc.

Attila
  • 28,265
  • 3
  • 46
  • 55
  • That might seem like a good idea, however, I believe parsing a template is somewhat more difficult to manage with an actual parser, than regular expressions. Furthermore, the result would only have to be like a preprocessor - as in, moving the data that isn't enclosed in `<%` and '%>' into `print`, or `write` function call. I hoped to be able to do that with a simple loop, instead of a fullblown parser. A fullblown parser seems exaggerated, to be honest. –  Jul 03 '12 at 07:59
  • 1
    @nebukadnezzar - Regular expresions are a form of formal grammar, so if you want to implement recognition of regular expressions on your own, you will have to write a parser for it. It might not have to be as general as to be able to parse all regular expressions, but enough to be able to parse the regular expression(s) that you want to use in the processing of the templates. So you will have to write a parser anyway. And if you are writing a parser, you should use the theory of formal grammars/parsers instead of hand-carfting an ad-hoc parser. – Attila Jul 04 '12 at 16:11
1

Have you thought about using a Finite State Machine? Here are some links for your to look at.

In short: FSM consists of finite number of states and transitions between these states. So the process of your parsing might be expressed as follows (pseudocode):

myFSM = new FSM( /* states, transitions */ );
// now your FSM is at initial state.

while not end of file {       
  switch (myFSM->currentState) {   
    case 'IF':
      // Does current line contain closing if? or else? If so, do a transition
      // to state that grabs everything in if construct 
      ...
    case 'TEXT': 
      // Lines do not have any lexical constructs, and we are outside any blocks
      ...
    ...

  }
}

Of course, this is hugely simplified. A real parser would look different. But I hope you got an idea.

Community
  • 1
  • 1
madfriend
  • 2,400
  • 1
  • 20
  • 26
0

There already might be existing solution for you.

Logicless template libraries:

https://github.com/leonidas/transparency/wiki/Frequently-Asked-Questions

(See the last question).

Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435