1

I have started working on a project which requires Natural Language Processing and building a model on Support Vector Machine (SVM) in R (I was requested to do it in R, though I know Python is more developed on this). I found an article here (Packages: NLP, OpenNLP, rJava, RWeka). However, the article focuses on how to extract key words (ex. Place, names…).

But since I want to build a SVM model, I’d like to generate a Term Document Matrix with all the tokens. I couldn’t get it work since the class of the annotation does not apply in tm package.

Example:

testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",  "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)

[[1]]
 [1] "From"       "month"      "2"          "the"        "AST"        "and"        "total"     
 [8] "bilirubine" "were"       "not"        "measured"   "."         

[[2]]
 [1] "16:OTHER"                         "-"                               
 [3] "COMMENT"                          "REQUIRED"                        
 [5] "IN"                               "COMMENT"                         
 [7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"                      
 [9] "consent"                          "not"                             
[11] "offered"                          "until"                           
[13] "T4"                               "."                               

[[3]]
[1] "M6"     "is"     "13"     "days"   "out"    "of"     "the"    "visit"  "window" 
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-2       openxlsx_3.0.0 magrittr_1.5   RWeka_0.4-28   openNLP_0.2-6  NLP_0.1-9     
[7] rJava_0.9-8   

loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-2 parallel_3.3.0      tools_3.3.0         Rcpp_0.12.5         slam_0.1-34        
[6] grid_3.3.0          knitr_1.13          RWekajars_3.9.0-1  

And now I don't know how to generate a TDM from here... Could anyone please give me some advice on this?

  • Could you explain what you mean by " the class of the annotation does not apply in tm package"? If you give a reproducible example that might also help. BTW "Python is more developed on this" is a *very* debatable statement ;) – Hack-R Jun 13 '16 at 17:50
  • Hi, Hack Sorry I am a newbie here and didn't know how to ask good questions. So I had a dataset that contains one of the columns which is the description of protocol: – Chih-Ching Yeh Jun 13 '16 at 18:13
  • I understand, no problem. I think we should be able to help you use that column if you can provide it or if you can use built-in data to simulate it as an example. Here's a good post that everyone should read: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Hack-R Jun 13 '16 at 18:15
  • Hi, Hack. I already reposted my question. Thanks for the help! – Chih-Ching Yeh Jun 13 '16 at 18:28
  • Richard and guys now that he added the example I think what he is asking is more clear. He's wanting to use a function (`annotate`) from the `openNLP` package in conjunction with `tm`. Is that still off topic? – Hack-R Jun 13 '16 at 18:32
  • Thanks Hack! So I want to use the algorithm of NLP to tokenize the texts but I don't know how to generate a TDM from there. Btw, I'm a girl :) – Chih-Ching Yeh Jun 13 '16 at 18:34

0 Answers0