1

Are there any existing tools for detecting whether a piece of text is source code or natural language? It does not need to identify the programming language nor the natural language. However, it would ideally be flexible for any programming and natural language.

For example, this piece of text would be identified as source code:

def fib(n):
    a, b = 0, 1
    while a < n:
        print(a, end=' ')
        a, b = b, a+b
    print()

and this piece of text would be identified as natural language:

Hello! This is natural language.
Rhea Lin
  • 11
  • 3
  • 1
    Possible duplicate of [Detecting programming language from a snippet](https://stackoverflow.com/questions/475033/detecting-programming-language-from-a-snippet) – Malbordio Aug 14 '17 at 21:08
  • 1
    I expect this to work just as well as (if not even better than) standard language identification. If you have some training material (labeled data), you can train a binary classifier using character n-grams as features. – lenz Aug 14 '17 at 22:54
  • 2
    @Malbordio: No, the question you linked to is about detecting the programming language used in some source code (snippet). This question is about differentiating source code from natural language. – fnl Aug 15 '17 at 12:04
  • I aggree that the easiest way to detect SC vs. NL would be along the lines of what @lenz suggested - counting characters and possibly bi- and tri-grams, and comparing the distribution of those characters. Code will have lots of symbols, camelCase n-grams, etc. And the (mean) length of the line(s), the number of empty lines, as well as the presence of indents will be helpful features, I'd guesstimate "ad hoc" here. – fnl Aug 15 '17 at 12:09
  • @lenz Yep, that was how I intended to go about it if there wasn't already an existing tool I could use. Thanks for you input everyone! – Rhea Lin Aug 15 '17 at 13:19
  • did you implement a solution by yourself? :) I'm looking for exactly the same thing – Kaschi14 Aug 07 '23 at 17:30

0 Answers0