0

I'm trying to create a corpus of articles from LexisNexis with the tm-package. The articles have been exported from LexisNexis as .html and are parsed into R with the tm.plugin.lexisnexis-package like so:

> library("tm")
> library("tm.plugin.lexisnexis")
> src <- LexisNexisSource("~/Desktop/lexisnexis.html")

Following the instructions in the tm.plugin.lexisnexis-documentation, I then create a corpus using the tm-package, like so:

> data <- Corpus(src, readerControl = list(language = NA))
Error in getNodeSet(tree, "//div[@class = 'c3']/p[@class = 'c1']/span[@class = 'c4']")[[1]] : 
  subscript out of bounds

What does this error mean, and how do I fix it?

Example html-data: link

ageil
  • 171
  • 1
  • 3
  • 16
  • Hmm, I'm not sure I understand. Am I missing something in my .html-file or is the `src`-object incomplete? – ageil Jan 06 '16 at 15:12
  • Not sure what is going on there. Please look for general solution of the above error here http://stackoverflow.com/questions/15031338/subscript-out-of-bounds-general-definition-and-solution – Agaz Wani Jan 06 '16 at 15:14

1 Answers1

1

I'm the author of the package. It's currently broken as the format used by LexisNexis is undocumented. I'll try to fix it, but if anybody proposes a patch, it will happen sooner. :-)