0

I'd like to ask a question on the issue I'm currently stuck with. When trying to scrape an HTML page (using RCurl), I encounter this error: "Error in curlMultiPerform(multiHandle): embedded nul in string". I read a lot about this type of error and advices on how to deal with it (including one from Duncan Temple Lang, the creator of RCurl package). But even after applying his advice (as follows) I am getting the same error:

htmlPage <- rawToChar(getURLContent(url, followlocation = TRUE, binary = TRUE))
doc <- htmlParse(htmlPage, asText=TRUE)

Am I missing something? Any help will be much appreciated!


Edit:

However, there's 2nd error I haven't mentioned in the original post. It occurs here:

data <- lapply(i <- 1:length(links),
               function(url) try(read.table(bzfile(links[i]),
                                            sep=",", row.names=NULL)))

The error: Error in bzfile(links[i]) : invalid 'description' argument.

'links' is a list of files' FULL URLs, constructed as follows:

links <- lapply(filenames, function(x) paste(url, x, sep="/"))

By using links[i], I'm trying to refer to the current element of links list in an ongoing iteration of `lapply().


Second Edit:

Currently I'm struggling with the following code. I found several more cases where people advise exactly the same approach, which keeps me curious why it doesn't work in my situation...

getData <- function(x) try(read.table(bzfile(x), sep = ",", row.names = NULL))
data <- lapply(seq_along(links), function(i) getData(links[[i]]))
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • @mnel: 'url' is a HTTP URL of a web page for open source projects repository information, such as this one: http://flossdata.syr.edu/data/fc/2013/2013-Dec. – Aleksandr Blekh Feb 05 '14 at 03:18
  • What package does `htmlParse` come from? Also, could you include the script as it prints out, so for example, it is clear precisely where the error comes in (at first line or second line)? Thanks. – David Diez Feb 06 '14 at 05:37
  • Also, if it is in the second part, could try using `gsub` to get most of the way there, e.g. `gsub("<[^>]+>", "", htmlPage, perl = TRUE)`. I'm not sure if this is precisely what you are looking for, but maybe it gets in the right direction? (Note that this is not a super generalizable solution, e.g. I can come up with examples that break it, but it might work for the particular pages you are considering.) – David Diez Feb 06 '14 at 05:40
  • @DavidDiez: Thank you for your comments. Actually this allowed me to pinpoint the location of errors more correctly - it appears that the issue is not related to `htmlParse` (BTW, it's from `XML` package). I'm still printing debug info in several places to figure out more precisely where and what the problem is. Will report as soon as I know more. Appreciate your help! – Aleksandr Blekh Feb 06 '14 at 06:51
  • That error means you're trying to read a binary that contains nul bytes. R doesn't like know what to do with them (or can't do anything with them, depending on how you want to think about it), so it fails. Are you trying to read the page you link to? Or one of the files linked on that page? If it's one of the files, note that they are bzip'd, so you're not going to be able to convert them to character without decompressing them first (and even then some are not text files). – Thomas Feb 06 '14 at 07:59
  • @Thomas, thank you for your comment. You're right that I'm trying to read binary files. Thanks to David Diez's comments, I realized that the errors occur in a different segment of code. I found that this segment was lacking proper treatment as binary and fixed it by using the following code: repoFiles = rawToChar(getURLContent(url, followlocation = TRUE, curl = curlHandle, binary = TRUE)) [cont'd in the next comment] – Aleksandr Blekh Feb 06 '14 at 09:28
  • However, there's 2nd error I haven't mentioned in the original post. It occurs here: `data <- lapply(i <- 1:length(links), function(url) try(read.table(bzfile(links[i]), sep=",", row.names=NULL)))` The error: `Error in bzfile(links[i]) : invalid 'description' argument`. 'links' is a list of files' FULL URLs, constructed as follows: `links <- lapply(filenames, function(x) paste(url, x, sep="/"))` By using `links[i]`, I'm trying to refer to the current element of `links` list in an ongoing iteration of `lapply()`. – Aleksandr Blekh Feb 06 '14 at 09:33
  • P.S. I know for a fact that bzip'd files are in CSV and, thus, text format. – Aleksandr Blekh Feb 06 '14 at 09:33
  • @AlexB. Why don't you move that comment into your original question (so it's easier to read) and then remove the RCurl tag (since the core issue I/O-related, not web-related). – Thomas Feb 06 '14 at 11:43
  • Looking at your code, you're trying to use `for`-loop syntax in an `lapply` statement. You just want: `lapply(links, function(url) try(read.table(bzfile(url), sep=",", row.names=NULL)))`. – Thomas Feb 06 '14 at 11:46
  • @Thomas, per your recommendation, I replaced the original post with the comment describing the real issue (removed 'RCurl' tag as well). However, I don't understand your advice to use `url` as an actual parameter in `bzfile()`. `url` is a name of formal parameter, but the actual parameter is "the **current element** of `links` list in an _ongoing iteration_ of `lapply()`". I read that I could do that using `mapply()`/`lapply()` and `seq_along()`. But I think `i <- 1:length(links)` is equal to `seq_along()`. I'd like to know how to use `i` to refer to current element in context of `lapply()`. – Aleksandr Blekh Feb 06 '14 at 12:37
  • I think you need to read up on `lapply`, it doesn't work the way you seem to think it does. I think the other problem is that because you constructed `links` using `lapply`, you have a list rather than a vector, so when you need to extract from it later you should use `[[` extraction instead of `[` extraction. – Thomas Feb 06 '14 at 13:00
  • @Thomas, you're right about `[[` notation, I missed it. But, even after updating my code with `seq_along()` and `[[`, it still produces the same error. It's strange, because the following correct answer (#1, by Dason) seems to be very much similar to my situation: http://stackoverflow.com/questions/9048375/extract-names-of-objects-from-list. – Aleksandr Blekh Feb 06 '14 at 14:22
  • @Thomas, this is what I currently have: `getData <- function(x) try(read.table(bzfile(x), sep = ",", row.names = NULL))` AND `data <- lapply(seq_along(links), function(i) getData(links[[i]]))`. (Sorry, don't know how to insert new line character in comments.) – Aleksandr Blekh Feb 06 '14 at 14:34
  • Maybe @Dason would also be so kind to take a look at my two previous comments, including one where I refer to his solution. – Aleksandr Blekh Feb 06 '14 at 14:41

2 Answers2

0

Sasha,

try this

library(XML)
url <- "http://flossdata.syr.edu/data/fc/2013/2013-Dec/"
doc <- htmlParse(url)
ndx <- getNodeSet(doc,"//table")

It works like a charm.

Good luck.

S.

Diegres
  • 1
  • 2
0

I was able to figure out myself the reasons of the issues described above. It took me a lot of time and effort, but it was worth it - now I understand R lists and lapply() better.

Essentially, I made three major changes:

1) added textConnection() and readLines() to process CSV-like files:

conn <- gzcon(bzfile(file, open = "r"))
tConn <- textConnection(readLines(conn))

However, I've discovered some issues with this approach - see my other SO question: Extremely slow R code and hanging.

2) used correct subscription notation to refer to the appropriate elements of list inside of function(i) passed to lapply():

url <- links[[1]][i]

3) used correct subscription notation to refer to whole list for lapply():

data <- lapply(seq_along(links[[1]]), getData)

Thanks to all who participated in and helped answering this question!

Community
  • 1
  • 1
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64