0

I have a few thousand sentences of varying length. The statements are in many forms, ranging from a 3 character reply to 4000 character reply with a lot code snippets. Code snippets can be any language.

How do I recognise comments which are questions (are interrogative) and does not have code snippets? Comments need not have a question form or a strict structural form.

The app is built on ruby on rails 3

Some example sentences:

1: How to solve segmentation fault? #valid
2: You'll have to use the BigInteger #invalid
3: some tips to remove runtime error #invalid
4: :disappointed: :disappointed: Okay #invalid (contains smilies)
5: In which category this problem fall? Graph Theory? #valid

shikhar.ja
  • 457
  • 1
  • 4
  • 11
  • I think you need to include some examples of each of the different things you want to distinguish between. – Max Williams Feb 17 '15 at 14:26
  • This might be relevant http://stackoverflow.com/questions/475033/detecting-programming-language-from-a-snippet – Max Williams Feb 17 '15 at 14:27
  • @MaxWilliams have added a few example statements (excluding ones with code snippets.). Main problem is to figure out if they are interrogative or not – shikhar.ja Feb 17 '15 at 14:41
  • Perhaps it is overkill (I can't determine from the question), but it might be worth looking into NLP gems to detect the intent of a sentence (if it's a question, etc) – kristof Feb 17 '15 at 14:52
  • 3
    based on your sample questions, just look for the presence of a '?' – Yule Feb 17 '15 at 15:01
  • Welcome to Stack Overflow. Your question is much too broad and would require a small book to describe how to do what you want in anything but a broad manner, which is off-topic for Stack Overflow. Instead, you need to try writing code, and when you run into problems, ask specific questions. "There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs." – the Tin Man Feb 17 '15 at 17:07

2 Answers2

1

This is an example of text classification problem, that is generally solved by generating some features and applying machine-learning classification algorithm to them.

For your particular case, question detection is well studied area. One of simplest possible approaches is heuristic one using regular expressions

Following solution is taken from this paper:

A sentence is detected as a question if it fulfills any of the following: • It ends with a question mark, and is not a URL. • It contains a phrase that begins with words that fit an interrogative question pattern. This is a generalization of 5W-1H question words. For example, the second phrase of “When you are free, can you give me a call” is a strong indicator that the sentence is a question. • It fits the pattern of common questions that are not in the interrogative form. For instance, “Let me know when you will be free” is one such question.

A more complex solutions are also described and you can find them is mentioned paper of googling "question detection algorithm"

For code snippet detection there are existing solutions that detect programming language, as mentioned in the comments. One example is http://www.rubyinside.com/sourceclassifier-identifying-programming-languages-quickly-1431.html

They probably can be adapted to detect if the specific sample is code or not. Or you can train simple Naive Bayes classifier using one of existing libraries

Denis Tarasov
  • 1,051
  • 6
  • 8
0

Text classification is one way of doing it, but for that you would need a good amount of sample data to train your model, to be able to detect your patterns accurately.

You can also parse these sentences to get parts of speech (POS) and then easily look for words like who, which, how, when etc to detect questions.

Stanford NLP has a Ruby library which provides POS tagger which you can use.

https://github.com/tiendung/ruby-nlp

skgemini
  • 600
  • 4
  • 7