11

Is it possible to detect a programming language source code (primarily Java and C# ) in a text?

For example I want to know whether there is any source code part in this text.

.. text text text text text text text text text
text text text text text text text text text
text text text text text text text text text

public static Person createInstance() { return new Person();}

text text text text text text text text text
text text text text text text text text text
text text text text text text text text text ..

I have been searching this for a while and I couldn't find anything.

A solution with Python would be wonderful.

Regards.

Kerem
  • 207
  • 4
  • 9
  • 2
    How reliable do you want this to be (how many false positives or false negatives do you want to allow)? Do you really just want to know *if* there is some source code somewhere in your text, or do you want to locate and delineate it from the rest of the text? – Tim Pietzcker Jul 05 '10 at 12:32
  • I don't think there's a magic way to it, as code is intercalated with "normal text", and probably almost impossible to be 100% right. (But never say never). – Andrei Ciobanu Jul 05 '10 at 12:34
  • Of course there will be false positive. It is impossible to avoid that. Yes i just want to know if there is some source code somewhere in my text. I don't need to locate it. Knowing is enough for my case. – Kerem Jul 05 '10 at 12:46
  • Something got my attention. What stackoverflow.com uses for syntax highlighting? It properly detects and highlights partial source code in the text I wrote as an example above. – Kerem Jul 05 '10 at 13:17
  • Some answers here may be useful: http://programmers.stackexchange.com/questions/87611/simple-method-for-reliably-detecting-code-in-text – NoBugs Jul 11 '11 at 23:17
  • A paper by Martin Robillard & Peter Rigby might be particularly relevant http://www.cs.mcgill.ca/~martin/papers/icse2013.pdf. It also focuses on all informal documentation, including StackOverflow posts! – HJM Jun 13 '13 at 23:41
  • A related question was discussed here: http://stackoverflow.com/questions/475033/detecting-programming-language-from-a-snippet. Difference: It was only about programming languages, not about programming questions buried in text. – james.garriss Oct 03 '14 at 11:34

2 Answers2

3

There are some syntax highlighters around (pygments, google-code-prettify) and they've solved code detection and classification. Studying their sources could give an impression how it is done.

(now that I looked at pygments again - I don't know if they can autodetect the programming language. But google-code-prettify definitly can do it)

Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
  • I've checked out pygments. It recognizes only full source code files. – Kerem Jul 05 '10 at 13:15
  • @Kerem - thought so, that's what it is designed for - but maybe you can iterate through the lines (or words..) and use the pygments functions on every iteration (iaw - testing every line if it is the start of a source code fragment) – Andreas Dolk Jul 05 '10 at 13:41
  • Why should a code beautifier detect code? All they do is to input code, apply grammar and mark based on some template. Most of them doesn't even try to detect the programming language – pouya Dec 02 '20 at 15:05
0

You would need a database of keywords with characteristics of those keywords (definition, control structures, etc.), as well as a list of operators, special characters that would be used throughout the languages structure (eg (},*,||), and a list of regex patterns.

The best bet, to reduce iterations, would be to search on the keywords/operators/characters. Using a spacial/frequency formula, only start at text that may be a language, based on the value of the returned formula. Then it's off to identifying what language it is and where it ends.

Because many languages have similar code, this might be hard. Which language is the following?

for(i=0;i<10;i++){
   // for loop
} 

Without the comment it could be many different types of languages. With the comment, you could at least throw out Perl, since it uses # as the comment character, but it could still be JavaScript, C/C++, etc.

Basically, you will need to do a lot of recursive lookups to identify proper code, which means that if you want something quick, you'll need a beast of a computer, or cluster of computers. Additionally, the search formula and identification formula will need to be well refined, for each language.

Code identification without proper library calls or includes may be impossible, unless listing that it could belong to many languages, which you'll need a syntax library for.

vol7ron
  • 40,809
  • 21
  • 119
  • 172