NLP to classify/label the content of a sentence (Ruby binding necesarry)

Question

I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.:

Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.)
Customer service problems (slow email response time, impolite response, etc.)
Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.)
Pricing complaint (hidden fee's discovered, etc.)

In order to perform this classification, I need a NLP that can recognize the combination of word groups like:

"[they|the company|the firm|the website|the merchant]"
"[did not|didn't|no]"
"[response|respond|answer|reply]"
"[before the next day|fast enough|at all]"
etc.

A few of these exemplified groups in combination should then match sentences like:

"They didn't respond"
"They didn't respond at all"
"There was no response at all"
"I received no response from the website"

And then classify the sentence as Customer service problems.

Which NLP would be able to handle such a task? From what I read these are the most relevant:

Stanford CoreNLP
OpenNLP

Check also these suggested NLP's.

Are you sure you need to use NLP? This looks like a classification problem to me. Have you tried simply extracting keywords from the email and training a classifier like naive-bayes? It might provide competitive results. — Harry, Jan 14 '14 at 14:07
That could be a solution, but I fear I will be setting the bar to low. If the test is to simple, I'll catch a lot of irrelevant sentences. I also though of the sad_panda gem, but again it seems to simple for my challenge. — Cjoerg, Jan 14 '14 at 14:29

score 3 · Answer 1 · answered Jan 14 '14 at 17:42

Using the OpenNLP doccat api, you can create training data and then a model from the training data. The advantage of this over something like a naive bayes classifier is that it returns a probability distribution over your set of categories.

so if you create a file with this format:

customerserviceproblems They did not respond
customerserviceproblems They didn't respond 
customerserviceproblems They didn't respond at all
customerserviceproblems They did not respond at all
customerserviceproblems I received no response from the website
customerserviceproblems I did not receive response from the website

etc.... provide as many samples as possible and make sure each line ends with a \n newline

using this appoach you can add anything you want that means "customer service problems" and you can also add any other categories as well, so you don't have to be too deterministic about what data falls into what categories

here is what the java looks like to build the model

DoccatModel model = null;
    InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove);
    try {

      ObjectStream<String> lineStream =  
              new PlainTextByLineStream(dataIn, "UTF-8");

      ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
      model = DocumentCategorizerME.train("en", sampleStream);
      OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile));
      model.serialize(modelOut);
      System.out.println("Model complete!");
    } catch (IOException e) {
      // Failed to read or parse training data, training failed
      e.printStackTrace();
    }

Once you have the model, you can then use it something like this:

DocumentCategorizerME documentCategorizerME;
  DoccatModel doccatModel; 

doccatModel = new DoccatModel(new File(pathToModelYouJustMade));
   documentCategorizerME = new DocumentCategorizerME(doccatModel);
 /**
 * returns a map of a category to a score
 * @param text
 * @return
 * @throws Exception 
 */
  private Map<String, Double> getScore(String text) throws Exception {
    Map<String, Double> scoreMap = new HashMap<>();
    double[] categorize = documentCategorizerME.categorize(text);
    int catSize = documentCategorizerME.getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = documentCategorizerME.getCategory(i);
      scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]);
    }
    return scoreMap;

  }

then in the returned hashmap you have each category that you modeled and a score, you can use the scores to decide which category the input text belongs to.

Thanks for this very thorough answer. This topic is very new to me, but I will sit down and try to implement this. I have found an article that can explain the general doccat concept to me (http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.doccat.classifying), and then I will try to implement your suggestion in my Ruby app with the Treat gem (https://github.com/louismullie/treat). I will then mark the question as answered. — Cjoerg, Jan 15 '14 at 08:43
Make sure your training file is correctly formatted. the category name cannot have spaces, and there is only a space between the category name and the sample, then end each sample with a new line (make sure samples do not have newlines in them). You want as many samples as possible. — Mark Giaconia, Jan 15 '14 at 12:06
Thanks, great tips. I am currently trying to figure out how to convert the described Java logic into Ruby. Seems easier than to learn a new language. — Cjoerg, Jan 15 '14 at 12:12

score 2 · Answer 2 · edited May 23 '17 at 11:52

Not entirely sure, but I can think of two ways of trying to solve your problem:

Standard Machine Learning

As stated in the comment, extract only keywords from each mail and train a classifier using them. Define your relevant keyword set beforehand and extract only those keywords from the email if they are present.

This is a simple but powerful technique and not to be underestimated as it yields very good results in many cases. You might want to try this one out first as more complex algorithms might be overkill.
Grammars

If you really want to delve into NLP, based on your question description, you might try defining some sort of grammar and parse the email based on that grammar. I don't have too much experience in ruby, but I'm sure some sort of lex-yacc equivalent tools exist. A quick web search gives this SO question and this. By identifying these phrases, you could judge which category an email falls under by calculating the proportion of phrases found for each category.

For example, intuitively, some productions within the grammar could be defined as:
```
{organization}{negative}{verb} :- delivery problems
```
where organization = [they|the company|the firm|the website|the merchant], etc.

These approaches might be a way to start.

NLP to classify/label the content of a sentence (Ruby binding necesarry)

2 Answers2

Linked