0

I have a data set of 1M+ observations of customer interactions with a call center. The text is free text written by the representative taking the call. The text is not well formatted nor is it close to being grammatically correct (a lot of short hand). None of the free text has a label on the data as I do not know what labels to provide.

Given the size of the data, would a random sample of the data (to give a high level of confidence) be reasonable first step in determining what labels to create? Is it possible not to have to manually label 400+ random observations from the data, or is there no other method to pre-process the data in order to determine the a good set of labels to use for classification?

Appreciate any help on the issue.

meb33
  • 31
  • 1
  • 5

2 Answers2

1

Manual annotation is a good option since you have a very good idea of an ideal document corresponding to your label.

However, with the large dataset size, I would recommend that you fit an LDA to the documents and look at the topics generated, this will give you a good idea of labels that you can use for text classification.

You can also use LDA for text classification eventually by finding out representative documents for your labels and then finding the closest documents to that document by a similarity metric(say cosine).

Alternatively, once you have an idea of labels, you can also assign them without any manual intervention using LDA, but then you will get restricted to unsupervised learning.

Hope this helps!

P.S. - Be sure to remove all the stopwords and use a stemmer to club together words of similar king example(managing,manage,management) at the pre-processing stage.

vendaTrout
  • 146
  • 7
  • Wow. That is... powerful. Thank you so much. If anyone is interested in seeing this implemented in R see here:http://stackoverflow.com/questions/14875493/lda-with-topicmodels-how-can-i-see-which-topics-different-documents-belong-to – meb33 Feb 07 '17 at 17:55
1

Text Pre-Processing: Convert all text to lower case, tokenize into unigrams, remove all stop words, use stemmer to normalize a token to it's base word.

There are 2 approaches I can think of for classifying the documents a.k.a. the free text you spoke about. Each free text is a document:

1) Supervised classification Take some time and randomly pick few samples of documents and assign them a category. Do this until you have multiple documents per category and all categories that you want to predict are covered.

Next, create a Tf-Idf matrix from this text. Select the top K features (tune value of K to get best results). Alternatively, you can use SVD to reduce the number of features by combining correlated features into one. Please bare in mind that you can use other features like the department of the customer service executive and many others also as predictors. Now train a machine learning model and test it out.

2) Unsupervised learning: If you know how many categories you have in your output variable, you can use that number as the number of clusters you want to create. Use the Tf-Idf vector from above technique and create k clusters. Randomly pick a few documents from each cluster and decide which category the documents belong to. Supposing you picked 5 documents and noticed that they belong to the category "Wanting Refund". Label all documents in this cluster to "Wanting Refund". Do this for all the remaining clusters.

The advantage of unsupervised learning is that it saves you the pain of pre-classification and data preparation, but beware of unsupervised learning. The accuracy might not be as good as supervised learning.

The 2 method explained are an abstract overview of what can be done. Now that you have an idea, read up more on the topics and use a tool like rapidminer to achieve your task much faster.

Arjun
  • 325
  • 3
  • 12