I have rewritten the following to clarify this problem and the solution, and left the function and solution as an example at the bottom. Thanks again to John Coleman for the help!
The problem: I created a data scrape function which worked when passing one url, but not a vector of urls, throwing this error:
Error in data.frame(address, recipename, prept, cookt, calories, protein, :
arguments imply differing number of rows: 1, 14, 0
It turned out that some of the urls I was trying to scrape had a different tag for their instructions sections. This resulted in the xpathSApply
scrape for instructions returning a list of length 0, which produced an error when passed to the rbind.
Figuring out the problem was simply a matter of running each url through until I found one that produced an error, and checking the html structure of that page.
Here is the function I originally wrote:
f4fscrape <- function(url) {
#Create an empty dataframe
df <- data.frame(matrix(ncol = 11, nrow = 0))
colnames <- c('address', 'recipename', 'prept', 'cookt',
'calories', 'protein', 'carbs', 'fat',
'servings', 'ingredients', 'instructions')
colnames(df) <- paste(colnames)
#check for the recipe url in dataframe already,
#only carry on if not present
for (i in length(url))
if (url[i] %in% df$url) { next }
else {
#parse url as html
doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)
#define the root node
top2 <- xmlRoot(doc2)
#scrape relevant data
address <- url[i]
recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
#create a data.frame of the url and relevant data.
result <- data.frame(address, recipename, prept, cookt,
calories, protein, carbs, fat,
servings, list(ingredients), instructions)
#rename the tricky column
colnames(result)[10] <- 'ingredients'
#bind data to existing df
df <- rbind(df, result)
}
#return df
df
}
And here's the solution - I simply added a conditional as follows:
instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
if (length(instructions) == 0) {
instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}