Text Categorization in R

Question

MY objective is to Automatically route the Feedback Email to respective division.
My fields are FNUMBER,CATEGORY, SUBCATEGORY, Description.
I have last 6 months Data in the above format - where the entire Email is stored in Description along with CATEGORY and SUBCATEGORY.

I have to analyse the DESCRIPTION column and find the Keywords for Each Category/subcategory and when next feedback email enters , it should automatically categorize into Categories and Sub categories based on the Keyword Generated from history Data.

I have Imported an XML file into R - for Text categorization in R and then converted the XML into a data frame with Required Fields. I have 23017 Records for a particular Month - I have only listed first twenty columns as a dataframe below.

I have more than 100 Categories and sub categries.
I am new to text mining Concept - however with the help of SO and tm package - I have tried below code:

step1 <-  structure(list(FNUMBER = structure(1:20, .Label = c(" 20131202-0885 ", 
"20131202-0886 ", "20131202-0985 ", "20131202-1145 ", "20131202-1227 ", 
"20131202-1228 ", "20131202-1235 ", "20131202-1236 ", "20131202-1247 ", 
"20131202-1248 ", "20131202-1249 ", "20131222-0157 ", "20131230-0668 ", 
"20131230-0706 ", "20131230-0776 ", "20131230-0863 ", "20131230-0865 ", 
"20131230-0866 ", "20131230-0868 ", "20131230-0874 "), class = "factor"), 
    CATEGORY = structure(c(9L, 14L, 11L, 6L, 10L, 12L, 7L, 11L, 
    13L, 13L, 6L, 1L, 2L, 5L, 4L, 8L, 8L, 3L, 11L, 11L), .Label = c(" BVL-Vocational Licence (VL) Investigation ", 
    " BVL - Bus Licensing ", " Corporate Transformation Office (CTO) ", 
    " CSV - Customer Service ", " Deregistration - Transfer/Split/Encash Rebates ", 
    " ENF - Enforcement Matters ", " ENF - Illegal Parking  ", 
    " Marina Coastal Expressway ", " PTQ - Public Transport Quality ", 
    " Road Asset Management ", " Service Quality (SQ) ", " Traffic Management & Cycling ", 
    " VR - Issuance/disputes of bookings by vendors ", " VRLSO - Update Owner's Particulars "
    ), class = "factor"), SUBCATEGORY = structure(c(2L, 15L, 
    5L, 1L, 3L, 14L, 6L, 12L, 8L, 8L, 18L, 17L, 11L, 10L, 16L, 
    7L, 9L, 4L, 13L, 12L), .Label = c(" Abandoned Vehicles ", 
    " Bus driver behaviour ", " Claims for accident ", " Corporate Development ", 
    " FAQ ", " Illegal Parking ", " Intra Group (Straddling Case) ", 
    " Issuance/disputes of bookings by vendors ", " MCE ", " PARF (Transfer/Split/Encash) ", 
    " Private bus related matters ", " Referrals ", " Straddle Cases (Across Groups) ", 
    " Traffic Flow ", " Update Owner Particulars ", " Vehicle Related Matters ", 
    " VL Holders (Complaint/Investigation/Appeal) ", " Warrant of Arrrest "
    ), class = "factor"), Description = structure(c(3L, 1L, 2L, 
    9L, 4L, 7L, 8L, 6L, 5L, 3L, 1L, 2L, 9L, 4L, 7L, 8L, 6L, 5L, 
    7L, 8L), .Label = c(" The street is the ONLY road leading to &amp; exit for vehicles and buses to (I think) four temples and, with the latest addition of 8B, four (!!) industrial estate.", 
    "Could you kindly increase the frequencies for Service 58. All my colleagues who travelled AVOID 58!!!\nThey would rather take 62-87 instead of 3-58", 
    "I saw bus no. 169A approaching the bus stop. At that time, the passengers had already boarded and alighted from the bus.", 
    "I want to apologise and excuse about my summon because I dont know can&apos;t park my motorcycle at the double line when I friday prayer ..please forgive me", 
    "Many thanks for the prompt action. However please note that the rectification could rather short term as it&apos;s just replacing the bulb but without the proper cover to protect against the elements.PS. the same job was done i.e. without installing a cover a few months back; and the same problem happen again.", 
    "Placed in such a manner than it cannot be seen properly due to the background ahead; colours blend.There is not much room angle to divert from 1st lane to 2nd lane. The outer most cone covers more than 1st lane", 
    "The vehicle GX3368K was observed to be driving along PIE towards Changi on 28th November 2013, 3:48pm without functioning braking lights during the day.", 
    "The vehicle was behaving suspiciously with many sudden brakes - which caused vehicles behind to do heavy &quot;jam brakes&quot; due to no warnings at all (no brake lights).", 
    "We have received a feedback regarding the back lane of the said address being blocked up by items.\nKindly investigate and keep us in the loop on the actions taken while we look into any fire safety issues on this case again."
    ), class = "factor")), .Names = c("FNUMBER", "CATEGORY", 
"SUBCATEGORY", "Description"), class = "data.frame", row.names = c(NA, 
-20L))  

dim(step1)
names(step1)
library(tm)
m <- list(ID = "FNUMBER", Content = "Description")
myReader <- readTabular(mapping = m)
txt <- Corpus(DataframeSource(step1), readerControl = list(reader = myReader))

summary(txt)
txt <- tm_map(txt,tolower)
txt <- tm_map(txt,removeNumbers)
txt <- tm_map(txt,removePunctuation)
txt <- tm_map(txt,stripWhitespace)
txt <- tm_map(txt,removeWords,stopwords("english"))
txt <- tm_map(txt,stemDocument)


tdm <- TermDocumentMatrix(txt,
                      control = list(removePunctuation = TRUE,
                                     stopwords = TRUE))
tdm

UPDATE: I have now got the Frequntly occuring keywords on the whole dataset:

tdm3 <-removeSparseTerms(tdm, 0.98)
TDM.dense <- as.matrix(tdm3)
TDM.dense = melt(TDM.dense, value.name = "count")
attach(TDM.dense)
TDM_Final <- aggregate(count, list(Terms), sum)
colnames(TDM_Final) <- c("Words","Word_Freq")

I am stuck after this. I am not sure about how to get:

1.The Relevant Keywords (unigrams,bi -grams and Trigrams) for Each Category/subcategory there by generating a Taxonomy list (Keywords with Ctaegory/Subcategory).

2.when next feedback email is entered how to categorize into Categories and Sub categories. (there are 100+ Categories ) based on the keyword taxonomy list generated on the above step.
3. Or if my above understanding and solution part is not correct, advise me on other possible option.

I have went through materials in internet (i can only able to see classification of text inot only two classes, not more than that) - but i am not able to proceed further.I am new to Text Mining in R - so excuse me , if this is very naive.

Any help or starting point would be great.

I need to find the `keywords` (which is most frequently occured words ) for a category/ Subcategory and then based on these keywords , when new feedback email comes in , i should check for the keywords and categorize them — Prasanna Nandakumar, Mar 10 '14 at 07:53
When i run you data-set i get 2 Errors. Can you please provide a reproduceable one. Then i have a chance to help you ;) — Rentrop, Mar 12 '14 at 20:41
This sounds like a job better suited to something like a naive Bayes classifier since you've got a training set and then wish to classify new elements from the training set. This https://code.google.com/p/rtexttools/source/browse/NaiveBayes.R?r=c8ec81e0f0c7dd089b8b44e9be360ea4617fe9d8 may help you get started down that path (and does build on the `tm()` work you've done so far). — hrbrmstr, Mar 14 '14 at 10:35
@hrbrmstr: excellent hint. more generally, once you've generated the TermDocumentMatrix from your training data, you can use its transpose as the feature matrix for any algorithm (knn, svm, etc.) to learn a classification, and then create a TermDocumentMatrix with the same terms in the same order on your prediction data in order to compute predicted categories. Also check out this question and its replies: http://stackoverflow.com/questions/3584472/text-classification-categorization-algorithm. — fabians, Mar 14 '14 at 11:06
Any thoughts on how to use Bag of approach to classify categories — Prasanna Nandakumar, Mar 14 '14 at 16:20
@Tyler Rinker I have updated the question, can you help me on this. Is there a better way to get `Keywords' for each category — Prasanna Nandakumar, Mar 18 '14 at 02:11
Are you training a dataset for later use? How will you know what division it goes to based on category and sub-category? I actually don't think n-gram will be the best approach but I'm still unsure what exactly you're trying to do. — Tyler Rinker, Mar 18 '14 at 02:20
Yes, i am training a dataset . I have a list of Division - category and subcategory from the domain poeple. Is there any other way , i can train the system and when the new email - comes in , i route it to correct division — Prasanna Nandakumar, Mar 18 '14 at 02:29
I only have corresponding list of Division - category and subcategory from the domain poeple. But i dont have Keywords within a DIVISON-CATEGORY-SUBCATEGORY. — Prasanna Nandakumar, Mar 18 '14 at 02:32

score 1 · Answer 1 · answered Mar 16 '14 at 09:48

1

I'll give a brief answer here because your question is a little vague.

This code below will quickly create a TDM for each CATEGORY for 2-grams.

library(RWeka)
library(SnowballC)

#Create a function that will produce a 'nvalue'-gram for the underlying dataset. Notice that the function accesses step1 data.frame external (it's not fed into the function). I'll leave it to someone else to fix this up!
makeNgramFeature=function(nvalue){

  tokenize=function(x){NGramTokenizer(x,Weka_control(min=nvalue,max=nvalue))}

  m <- list(ID = "FNUMBER", Content = "Description")
  myReader <- readTabular(mapping = m)
  txt <- Corpus(DataframeSource(step1), readerControl = list(reader = myReader))

  summary(txt)
  txt <- tm_map(txt,tolower)
  txt <- tm_map(txt,removeNumbers)
  txt <- tm_map(txt,removePunctuation)
  txt <- tm_map(txt,stripWhitespace)
  txt <- tm_map(txt,removeWords,stopwords("english"))
  txt <- tm_map(txt,stemDocument)


  tdm <- TermDocumentMatrix(txt,
                            control = list(removePunctuation = TRUE,
                                           stopwords = TRUE,
                                           tokenize=tokenize))
  return(tdm)
}

# All is a list of tdm for each category. You could simply create a 'cascade' of by functions, or create a unique list of category/sub-category pairs to analyse.
all=by(step1,INDICES=step1$CATEGORY,FUN=function(x){makeNgramFeature(2)})

The resulting list 'all' is a little ugly. You can run names(all) to look at the categories. I'm sure there is a cleaner way to solve this, but hopefully this gets you going on one of the many correct paths...

answered Mar 16 '14 at 09:48

slimCity

84
2

Than ks for the Suggestion, The `by statement` is executing for a long time for each category, becoz of large text collection . – Prasanna Nandakumar Mar 17 '14 at 07:56
@Prasanna Nandakumar can you please share the result you have achieved – KRU Apr 20 '15 at 07:58
@KRU I Used One-Vs all SVM clasiffier using RTextTool to train the classifier and predict for new instance. – Prasanna Nandakumar Apr 20 '15 at 09:12
@PrasannaNandakumar can use it to label my text doc(email content )? - [link](http://stackoverflow.com/questions/29692571/svm-for-text-classification-in-r) – KRU Apr 20 '15 at 10:05
@KRU read through your link - you have Multi-class clssification (Access,Report,data) . Check on One Vs all classifier. – Prasanna Nandakumar Apr 20 '15 at 10:27
So that is not possible through SVM ? my bad ! – KRU Apr 20 '15 at 10:28
@KRU it is possibe using SVM - RTextTools. Set a threshold value and if ur probability exceeds the threshold , assign it to that class. I am not sure, if this what u wanted. – Prasanna Nandakumar Apr 20 '15 at 10:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/75698/discussion-between-prasanna-nandakumar-and-kru). – Prasanna Nandakumar Apr 20 '15 at 10:36

Text Categorization in R

1 Answers1