Parse a HTML containing file to get the server-sided script

Question

I'm currently building a web-server which can receive request, and send back a response. I've managed to embed a port of Google's v8 JavaScript engine to c# (javascript.net) to my project and I want to parse a requested file and run the server-sided JavaScript code that in it. I decided that this code will be contained inside a 2-character brackets, <: for opening and :> for closing. I started to parse it with code I written but after encountering some problems which made the code more messy and probably not very efficient I decided to go ahead and try using RegEx (I had you study it because I've never used it before). BUT WAIT. After talking to my friend about it he send me this post RegEx match open tags except XHTML self-contained tags I understood that it isn't a good idea... So my question is, How do I parse such thing? (Taking efficiency and clean code into account, after all it's a webserver). Thanks in advance!

I'm trying to take a file and pull out of it all the things written between this <: and this :>. For example in php you have a file contaning html code and php code, all the php code is in , and something parsing the file pull out the php code within those brackets. — UnTraDe, Feb 10 '13 at 22:17
They don't send a file, they send a request for a file that already existing on the server. e.g you type 127.0.0.1:8080/test.html the server at 127.0.0.1 send a response contaning the contents of test.html, but I want to parse it first and get all the server-side script inside — UnTraDe, Feb 10 '13 at 22:25
Ok thank you for you're comments, could you refer me to someplace which explains these concepts? Thought I'm thinking I know what I'm talking about, but I'm ready to go deeper understanding what you're talking about. — UnTraDe, Feb 10 '13 at 22:32

score 0 · Answer 1 · answered Feb 10 '13 at 22:31

0

If I understand well, you want to take everything betwen "<" and ">", even "<" and ">" which are in it? Well... Since you can use RegEx for this, maybe try to find first "<", make counter which will increase for every next "<" and decrease for every ">". When the counter will be at 0, and next ">" appears: here you have end of the server-side script. If you will have some embedded HTML and want to get rid of them, try to detect """" or something like that. This solution is slow, but the simplest i can imagine.

answered Feb 10 '13 at 22:31

Mateusz Krzaczek

614
5
17

Thank you for you're answer, but from what I have read in the post I mentioned in the question, I understood it is not a good idea to use regex for such task. You're idea sounds like a solution but you've said it is slow, and I really need performance optimization here. If you have a better idea which will work faster please explain it to me as you did in you're answer. – UnTraDe Feb 10 '13 at 22:36

icktoofay · Accepted Answer · 2013-02-10T22:55:13.820

0

Ideally, what you'd want to do is hook into V8's lexer so you don't end up catching things inside of strings and such. I looked at the source to that .NET wrapper, however, and it looks like it doesn't allow that much customization. Instead, you may want to create a small state machine. You'd probably want at least these states:

Literal data (for stuff outside of your <: and :> tags)
Left angle bracket (for once you've consumed a < and are waiting for a potential :)
Script state (for stuff inside of your <: and :> tags)
Script double-quote string state
Script double-quote string escape state
Script single-quote string state
Script single-quote string escape state
Script slash state (for comments and regular expressions¹)
Script line comment state
Script block comment state
Script block comment star state
Script regular expression state
Script colon state (for when you've encountered a : and are unsure whether a > or something else is next)

It may not be so quick to write as a regular expression, but it would be able to handle code like this:

Hello, world!
<:
    document.write("At least you won't think the script :> ends there.");
:>

¹On second thought, it's probably not so easy to detect regular expressions.

edited Feb 10 '13 at 22:55

answered Feb 10 '13 at 22:38

icktoofay

126,289
21
250
231

I understood that RegEx is no solution for this task, (because of security issues?) and I want to know how they do it in webservers that running php/jsp/.net etc... Is there a common method to use in such cases? – UnTraDe Feb 10 '13 at 22:43
@UnTraDe: I don't know how JSP and .NET do it, but PHP has it integrated into its parser. As I was saying at the start of my answer, the ideal way would be to hook in to V8 and do the same thing, but your .NET wrapper doesn't allow it. – icktoofay Feb 10 '13 at 22:44
I have the option here to switch to C++ and use the original v8 engine, (I used v8 in C++ before) but since it is a "fun" project which I'm doing just for fun, I think it's faster to write and produce a functioning webserver in C# especially with that .NET wrapper, which make it easier to embed js code to c# code. But if you think it will be easier to parse it using c++ (maybe v8 has something like you mentioned in the comment above, or a c++ library) I think it will worth the trouble to switch. – UnTraDe Feb 10 '13 at 22:50
@UnTraDe: I really do not know if V8 offers the feature at all, but I know that if it did, it would be a better solution. – icktoofay Feb 10 '13 at 22:53
So I was just thinking a bit more about this and if it can be done with a finite state machine, it can probably be done with a regular expression, too (albeit not a very simple one). – icktoofay Feb 10 '13 at 23:17

Parse a HTML containing file to get the server-sided script

2 Answers2