Detect if source is CSS/HTML/JavaScript

Question

I want to use js beautify on some source but there isn't a way to detect what type of source it is. Is there any way, crude or not, to detect if the source is css, html, javascript or none?

Looking at their site they have this that looks like it'll figure out if it's html:

function looks_like_html(source) {
    // <foo> - looks like html
    // <!--\nalert('foo!');\n--> - doesn't look like html
    var trimmed = source.replace(/^[ \t\n\r]+/, '');
    var comment_mark = '<' + '!-' + '-';
    return (trimmed && (trimmed.substring(0, 1) === '<' && trimmed.substring(0, 4) !== comment_mark));
}

just need to see if it's css, javascript or neither. This is running in node.js

So this code would need to tell me it's JavaScript:

var foo = {
    bar : 'baz'
};

where as this code needs to tell me it's CSS:

.foo {
    background : red;
}

So a function to test this would return the type:

function getSourceType(source) {
    if (isJs) {
        return 'js';
    }
    if (isHtml) {
        return 'html';
    }
    if (isCss) {
        return 'css';
    }
}

There will be cases where other languages are used like Java where I need to ignore but for css/html/js I can use the beautifier on.

What is the output you expect. For instance, if I pass a string to the function that determines what it is, what do you expect as a return? — Bram Vanroy, Jun 10 '15 at 18:17
Is this for a page that has JavaScript and CSS embedded? You could check for ` — Spencer Wieczorek, Jun 10 '15 at 18:22
This could be a multitude of it depending on implementation. If it's a single function to determine the type then it could return a string (`'css'`, `'html'`, `'js'`, `null`) or if there are separate functions then a bool for `isCss` function. — Mitchell Simoens, Jun 10 '15 at 18:23
I don't have any time anymore today, but for the ones who are interested in solving this question, [here's my start](http://jsfiddle.net/BramVanroy/k7xnayu7/). Not sure how you can get a quick regex check for JS though. Good luck! — Bram Vanroy, Jun 10 '15 at 18:30
Thanks @BramVanroy that's a good rough start. However a JS test I added is being picked up as CSS but it's a start. — Mitchell Simoens, Jun 10 '15 at 18:45
It's easy to check the CSS and HTML, not as easy for JS. You could do it in the form of: `if(foo){ 'HTML'} else if(bar) { 'CSS' } else 'JS'`. Would that design work for you? — Spencer Wieczorek, Jun 10 '15 at 18:46
@SpencerWieczorek unfortunately other languages like Java are likely to be used also. — Mitchell Simoens, Jun 10 '15 at 18:47
@MitchellSimoens Alright that's what I was thinking, it's not going to be an easy solution, it might not even be possible (for other programming languages that is)... Since the syntax could be too similar. — Spencer Wieczorek, Jun 10 '15 at 18:50
@SpencerWieczorek I agree on maybe not being possible and if it is may be unstable. But thought I'd throw it out to see if anyone has any idea. — Mitchell Simoens, Jun 10 '15 at 18:52
You can take a look at the source for [highlight.js](https://highlightjs.org/download/). It allows for a custom package, in other words, only allow for CSS, JS, HTML and Java. *And* it supports node.js integration. — Bram Vanroy, Jun 10 '15 at 19:57
'#' I would have thought would be only on css when not inside quotes mark. 'var' with a space not inside tags should be good for JS and only check those if not html. — GillesC, Jun 10 '15 at 21:12
FWIW, Google Code Prettify does something similar to what you're looking to achieve, but it doesn't try *too* hard. — BoltClock, Jun 16 '15 at 16:54
Function(source) throws on invalid JS code. You should find CSS parser for Node.js and check if CSS is parsed without errors. If these tests do not pass, it's HTML. — Ginden, Aug 24 '15 at 02:38

hogan · Answer 1 · 2015-06-12T17:07:44.263

Short answer: Almost impossible.

- Thanks to Katana's input

The reason: A valid HTML can contain JS and CSS (and it usually does). JS can contain both css and html (i.e.: var myContent = '< div >< style >CSS-Rules< script >JS Commands';). And even CSS can contain both in comments.

So writing a parser for this close to impossible. You just cannot separate them easily.

The languages have rules upon how to write them, what you want to do is reverse architect something and check whether those rules apply. That's probably not worth the effort.

Approach 1

If the requirement is worth the effort, you could try to run different parsers on the source and see if they throw errors. I.e. Java is likely to not be a valid HTML/JS/CSS but a valid Java-Code (if written properly).

Approach 2 - Thanks to Bram's input

However if you know the source very well and have the assumption that these things don't occur in your code, you could try the following with Regular Expressions.

Example

<code><div>This div is HTML var i=32;</div></code> 
<code>#thisiscss { margin: 0; padding: 0; }</code>
<code>.thisismorecss { border: 1px solid; background-color: #0044FF;}</code>
<code>function jsfunc(){ { var i = 1; i+=1;<br>}</code>

Parsing

$("code").each(function() {
    code = $(this).text();
   if (code.match(/<(br|basefont|hr|input|source|frame|param|area|meta|!--|col|link|option|base|img|wbr|!DOCTYPE).*?>|<(a|abbr|acronym|address|applet|article|aside|audio|b|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frameset|head|header|hgroup|h1|h2|h3|h4|h5|h6|html|i|iframe|ins|kbd|keygen|label|legend|li|map|mark|menu|meter|nav|noframes|noscript|object|ol|optgroup|output|p|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video).*?<\/\2/)) {
      $(this).after("<span>This is HTML</span>");
   }
   else if (code.match(/(([ trn]*)([a-zA-Z-]*)([.#]{1,1})([a-zA-Z-]*)([ trn]*)+)([{]{1,1})((([ trn]*)([a-zA-Z-]*)([:]{1,1})((([ trn]*)([a-zA-Z-0-9#]*))+)[;]{1})*)([ trn]*)([}]{1,1})([ trn]*)/)) {
      $(this).after("<span>This is CSS</span>");
   }
   else {
      $(this).after("<span>This is JS</span>");
   }
});

What does it do: Parse the text.

HTML

If it contains characters like '<' followed by br (or any of the other tags above) and then '>' then it's html. (Include a check as well since you could compare numbers in js as well).

CSS

If it is made out of the pattern name(optional) followed by . or # followed by id or class followed by { you should get it from here... In the pattern above I also included possible spaces and tabs.

JS

Else it is JS.

You could also do Regex like: If it contains '= {' or 'function...' or ' then JS. Also check further for Regular Expressions to check more clearly and/or provide white- and blacklists (like 'var' but no < or > around it, 'function(asdsd,asdsad){assads}' ..)

Bram's Start with what I continued was:

$("code").each(function() {
   code = $(this).text();
   if (code.match(/^<[^>]+>/)) {
       $(this).after("<span>This is HTML</span>");
   }
   else if (code.match(/^(#|\.)?[^{]+{/)) {
     $(this).after("<span>This is CSS</span>");
   }
});

For more Information:

http://regexone.com is a good reference. Also check http://www.sitepoint.com/jquery-basic-regex-selector-examples/ for inspiration.

This should be a comment - or at least be improved. I think the OP had figured out what you just wrote down. — Bram Vanroy, Jun 10 '15 at 18:35
Spencer is right, this is why I wrote about regular expressions. This is not the solution but a starting point. — hogan, Jun 10 '15 at 18:48
`>` is in HTML, CSS and JS. `<` is in HTML and JS. `=` is in HTML and JS. `{` is in JS and CSS. So your starting point is, bluntly, very poor — Dendromaniac, Jun 10 '15 at 19:55
As I wrote, this is very brief and gives an example. Please review and update your comments. — hogan, Jun 10 '15 at 20:00
Please give reason when downvote or I can't get this answer better. — hogan, Jun 10 '15 at 20:24
@Hogan Plain and simple, for all kinds of JS/CSS/HTML, it's not going to work. Example: `var myTemplateHTML = "
html
";` is JS, not HTML. Even a CSS comment that contains HTML (possible in some CSS commenting systems) would break it. — Katana314, Jun 10 '15 at 20:40
I would've appreciated credit given to me for the start without you blatantly taking over my code. You even remove the most important part of the regex to check HTML. `^`. — Bram Vanroy, Jun 12 '15 at 12:13
@everybody: The question was 'Is there any way, crude or not, to detect if the source is css, html, javascript or none?' The question was not about having the solution ready. The answer to this question was satisfied with the first two answers here already. However as you like to have something complete and 'not blunt' I updated the answer regardless whether this will be upvoted again or not and regardless of what the original Person might accept as an answer. A little less stiffness would be appreciated, I only try to answer the question here. — hogan, Jun 12 '15 at 17:17
@Bram: I updated the answer, should be what you wanted. I removed your html-regex because I don't understand it well enough and replaced it with something I can stand behind of. If you want to take the time, you could explain the regex and tell, where it is better than my solution / where there are weaknesses. — hogan, Jun 12 '15 at 17:20

score 0 · Answer 2 · answered Jun 10 '15 at 18:33

It depends if you are allowed to mix languages, as mentioned in the comments (i.e. having embedded JS and CSS in your HTML), or if those are separate files that you need to detect for some reason.

A rigorous approach would be to build a tree from the file, where each node would be a statement (in Perl, you can use HTML::TreeBuilder). Then you could parse it and compare with the original source. Then proceed by applying eliminating regexes to weed out chunks of code and split languages.

Another way would be to search for language-specific patterns (I was thinking that CSS only uses " *= " in some situations, therefore if you have " = " by itself, must be JavaScript, embedded or not). For HTML you for sure can detect the tags with some regex like

    if($source =~ m/(<.+>)/){}

Basically then you would need to take into account some fancy cases like if the JavaScript is used to display some HTML code

    var code = "<body>";

Then again it really depends on the situation you are facing, and how the codes mix.