R: Subscript out of bounds when using tm function Corpus on LexisNexis-data

Question

I'm trying to create a corpus of articles from LexisNexis with the tm-package. The articles have been exported from LexisNexis as .html and are parsed into R with the tm.plugin.lexisnexis-package like so:

> library("tm")
> library("tm.plugin.lexisnexis")
> src <- LexisNexisSource("~/Desktop/lexisnexis.html")

Following the instructions in the tm.plugin.lexisnexis-documentation, I then create a corpus using the tm-package, like so:

> data <- Corpus(src, readerControl = list(language = NA))
Error in getNodeSet(tree, "//div[@class = 'c3']/p[@class = 'c1']/span[@class = 'c4']")[[1]] : 
  subscript out of bounds

What does this error mean, and how do I fix it?

Example html-data: link

Hmm, I'm not sure I understand. Am I missing something in my .html-file or is the `src`-object incomplete? — ageil, Jan 06 '16 at 15:12
Not sure what is going on there. Please look for general solution of the above error here http://stackoverflow.com/questions/15031338/subscript-out-of-bounds-general-definition-and-solution — Agaz Wani, Jan 06 '16 at 15:14

score 1 · Answer 1 · answered Jan 09 '16 at 21:47

1

I'm the author of the package. It's currently broken as the format used by LexisNexis is undocumented. I'll try to fix it, but if anybody proposes a patch, it will happen sooner. :-)

answered Jan 09 '16 at 21:47

Milan Bouchet-Valat

504
2
5

R: Subscript out of bounds when using tm function Corpus on LexisNexis-data

1 Answers1