8

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.

I want to extract all comments from this html file.

I can point out two problems in doing this:

  1. What is a comment in one language may not be a comment in another.

  2. In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I just apply substituting // and then the remainder of the line, with nothing.

So this is not a trivial problem.

Is there anywhere some solution for this already available?

Has anybody already done this?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
john-jones
  • 7,490
  • 18
  • 53
  • 86
  • 3
    You are right that this is not trivial. In order to reliably remove comments, you need to fully parse the file (PHP, HTML, and Javascript). I suggest working in PHP if possible; while I like Perl better, PHP's tools to work on itself are better than Perl tools to work on PHP. Here is something to get you started: http://stackoverflow.com/questions/503871/best-way-to-automatically-remove-comments-from-php-code. Then you just need to find HTML and javascript parsers in PHP to do likewise for those portions of the file. – dan1111 Oct 19 '12 at 10:41
  • Why would you have PHP in your HTML file? I you just have CSS, JavaScript and HTML, then google "HTML Minifier" for products which can remove comments, whitespace, and generally "slim down" your pages. – RB. Oct 19 '12 at 10:43
  • @RB, the html to parse may at some point, not even be mine. – john-jones Oct 19 '12 at 10:56
  • Your point #2 is precisely why I always use /// in my comments -- just a random point, but I have come across this problem before and it changed my commenting habits forever ;) what is your reasons behind needing this ability? and by "extract", do you mean to keep comments or discard them? – Pebbl Oct 19 '12 at 11:02
  • 1
    Well I intend to discard them, but to not be bound with doing that with them would be a more modular solution. – john-jones Oct 19 '12 at 11:07
  • What do you mean by extracting? Do you want to use those comments or do you want to remove those comments? – Chankey Pathak Oct 19 '12 at 17:05
  • Well extracting them can be in the form of for instance getting their location within the file. So I would receive an index list indicating where comments begin and end. Like I've already said, my intention this time around is to discard them. – john-jones Oct 19 '12 at 17:56
  • I can't claim credit for this gist, but something like this could get you moving in the right direction: https://gist.github.com/3837258 – oalders Oct 19 '12 at 19:41

4 Answers4

2

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.

1

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.

Consider

Problem 3, a comment in a language is not always a comment in a language.

<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>

Problem 4, a comment embedded in a language may not obviously be a comment.

<button onclick="&#47;&#47; this is a comment//&#10;notAComment()">

Problem 5, what is a comment may depend on how the browser is configured.

<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->

I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.

https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
0

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.

Daniel
  • 1,357
  • 2
  • 19
  • 39
0

I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.

The start/end indices will have to be identified properly, which is a challenge but you will have full power.

There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.