3

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:

  1. Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
  2. I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.

Any clues are deeply appreciated.

TheCodeArtist
  • 21,479
  • 4
  • 69
  • 130
PepperoniPizza
  • 8,842
  • 9
  • 58
  • 100
  • 1
    What is the scope of your project? How many languages do you need it to detect? Are false positives or false negatives more important to minimize? If you don't care what kind of language you detect, http://programmers.stackexchange.com/questions/87611/simple-method-for-reliably-detecting-code-in-text – Patashu May 07 '13 at 04:49
  • Project is medium size, and will be used to filter harvested sources, so false negatives are not a worry, false positives are important to avoid. About languages I guess as much as possible. – PepperoniPizza May 07 '13 at 04:50
  • Dupe of http://stackoverflow.com/questions/475033/detecting-programming-language-from-a-snippet ? At the very least, the [linguist](https://github.com/github/linguist) looks like pretty much what you're looking for. (Or as close as you're likely to find.) – Lucas Wiman May 09 '13 at 05:09
  • This SO question probably has the answer you're looking for http://stackoverflow.com/questions/325165/is-there-a-library-that-will-detect-the-source-code-language-of-a-block-of-code – elssar May 09 '13 at 05:36
  • Does this answer your question? [Is there a library that will detect the source code language of a block of code?](https://stackoverflow.com/questions/325165/is-there-a-library-that-will-detect-the-source-code-language-of-a-block-of-code) – MatthewMartin Jan 17 '21 at 17:44

3 Answers3

3

You could have a look at methods around baysian filtering.

kiriloff
  • 25,609
  • 37
  • 148
  • 229
3

Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.

Jokester
  • 5,501
  • 3
  • 31
  • 39
  • Could you please post an example or a package that does this ? All that I saw need you to specify the highlihting language. – PepperoniPizza May 09 '13 at 04:56
  • Depending on efficiency requirements, you could just loop through all the supported languages and see if any of them parse. – Lucas Wiman May 09 '13 at 05:07
  • 1
    @PepperoniPizza My applogize. I found that many packages actually detect language by extension. Anyway I found [a js implementation](https://github.com/isagalaev/highlight.js/blob/master/src/highlight.js#L420) of code-language relevance. – Jokester May 09 '13 at 12:30
  • @jokester, cool, that's somehow what I was looking for, it's a shame it's not python written. – PepperoniPizza May 09 '13 at 15:40
2

My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).

This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.

If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

Seth Curry
  • 151
  • 4