1

I have a program that logs every GET/POST request made by a website during the page load process. I want to go through these requests one by one, execute them, and then determine if the file that was returned is a Javascript. Given that it won't have a .js ending (because of scripts like this, yanked from google.com a minute ago), how can I parse the file gotten from the request and identify if it is a Javascript file?

Thanks!

EDIT: It is better to get a false positive than a false negative. That is, I would rather have some non-JS included in the JS-list than cut some real JS from the list.

K. Dackow
  • 456
  • 1
  • 3
  • 15
  • 2
    & what you've tried to do that? – A l w a y s S u n n y Jul 30 '18 at 17:03
  • 4
    check content-type – Madhawa Priyashantha Jul 30 '18 at 17:05
  • It's hard since javascript doesn't have a specific pattern inside of it. A file containing `'hey!';` can be considered javascript if you change the extension to `js`. Basically, it's a plain text file with a `js` extension. – Phiter Jul 30 '18 at 17:05
  • I have trie to find something akin to ` ` but it does not appear to be standardized. I have also considered just parsing through all of the code as if it is JS and then when exceptions get thrown (e.g. the binary from an img would not be read properly) marking the files as not JS. This just seems a bit dangerous to me, as it could have some non-JS code in the JS list, which I need to avoid. – K. Dackow Jul 30 '18 at 17:07
  • @FastSnail is content-type necessarily served for all GET/POST requests? – K. Dackow Jul 30 '18 at 17:08
  • If the server doesn't set the correct content-type, browsers won't execute the javascript code. So depending on your use case, content-type might solve your issue. – Bernard Jul 30 '18 at 17:12
  • RFC 2616 says [SHOULD include a content type header](https://tools.ietf.org/html/rfc2616#section-7.2.1), so you'll almost always (if not always) have one, yes. Depending on how paranoid you're being e.g. you're looking for script being hidden in other content you might not want to rely on it. – Rup Jul 30 '18 at 17:12
  • @Bernard Do you have a reference for that? I would love to read more about how browsers identify the JS to execute. – K. Dackow Jul 30 '18 at 17:32
  • @Rup I am concerned about that problem, actually, but do you have any examples of JS embedded within other files like this? I edited the post to reflect my biggest concern for this. – K. Dackow Jul 30 '18 at 17:33
  • @K.Dackow Sorry, it looks like I was totally wrong, and what happens is the exact opposite. See [this answer](https://stackoverflow.com/a/37863890/1021959). – Bernard Jul 31 '18 at 04:02

2 Answers2

1

The javascript link that you referred does not have a content type, nor does it have the js extension. Any text file can be considered javascript if it can get executed which can make detection from scratch very difficult. There are two methods that come to mind.

  1. Run a linter on the file contents. If the error is a syntax error or a Parsing error, it is not javascript. If there are no syntax error or parsing error, it should be considered javascript

  2. Parse the AST (Abstract syntax tree) for the file contents. A javascript file would parse without errors. There should be a number of AST libraries available. I haven't worked with JS AST, so can't recommend any one of them but a quick search should give you some options.

I am not sure but probably a linter would also run AST before doing syntax checks. In this case, running AST seems like a lighter option.

cnvzmxcvmcx
  • 1,061
  • 2
  • 15
  • 32
  • Example AST: https://github.com/benjamn/recast Learn more about AST here: https://stackoverflow.com/questions/16127985/what-is-javascript-ast-how-to-play-with-it – cnvzmxcvmcx Jul 30 '18 at 17:22
  • I am running some tests now to see if the response headers included by the google link have the content-type, but this seems like a good failsafe. I have a large corpus of site data I will test on to determine if content-type is effective. Thank you! – K. Dackow Jul 30 '18 at 17:29
0

The easiest way would be to check if there was anything identifying javascript files by their URI, because the alternatives are a lot heavier. But since you said this isn't an option, you can always check the syntax of the contents of each file using some heuristic tool. You can also check the response headers for its content-type.

Webber
  • 4,672
  • 4
  • 29
  • 38
  • What heuristic tools are you referring to? – K. Dackow Jul 30 '18 at 17:11
  • No specific tool. First i would determine if you can have false positives or false negatives. If so, you can simply check the file for valid javascript syntax. Which might be as simple as running the script on node to see whether it returns an error code or not. – Webber Jul 30 '18 at 17:15
  • I cannot really have either, but a false negative (i.e. cutting a real Javscript) is likely worse than keeping a non-JS file. Thank you! – K. Dackow Jul 30 '18 at 17:28
  • In practice, you would be able to identify the linked URI as Javascript 3 times in the first 14 characters after the domain name `xjs/_/js/k=xjs`. – Webber Jul 30 '18 at 17:49
  • Yes, however not all domain names include JS in them. Additionally there is the risk of some bad site layout having JS in the URL when it’s not a JS file (Also I am checking for that!) – K. Dackow Jul 30 '18 at 17:51
  • Those file that have `js` in the URL but aren't actually javascript files; do they have a content-type other than javascript? Also, you might be able to differentiate urls having js by just filtering smartly (e.g. `/js/*` wouldn't contain anything other than js, whereas `badjs.css` shouldn't be matched as a javascript file). – Webber Jul 30 '18 at 17:58
  • Though this has potential. I will not do this, as that volume of special casing makes it more convoluted than just parsing the code using a linter. Thank you, though! – K. Dackow Jul 30 '18 at 18:00