1

I have rewritten the following to clarify this problem and the solution, and left the function and solution as an example at the bottom. Thanks again to John Coleman for the help!

The problem: I created a data scrape function which worked when passing one url, but not a vector of urls, throwing this error:

Error in data.frame(address, recipename, prept, cookt, calories, protein, : arguments imply differing number of rows: 1, 14, 0

It turned out that some of the urls I was trying to scrape had a different tag for their instructions sections. This resulted in the xpathSApply scrape for instructions returning a list of length 0, which produced an error when passed to the rbind.

Figuring out the problem was simply a matter of running each url through until I found one that produced an error, and checking the html structure of that page.

Here is the function I originally wrote:

f4fscrape <- function(url) {

#Create an empty dataframe

    df <- data.frame(matrix(ncol = 11, nrow = 0))
    colnames <- c('address', 'recipename', 'prept', 'cookt',
                  'calories', 'protein', 'carbs', 'fat',
                  'servings', 'ingredients', 'instructions')
    colnames(df) <- paste(colnames)

    #check for the recipe url in dataframe already,
    #only carry on if not present

    for (i in length(url)) 
            if (url[i] %in% df$url) { next }
    else {

    #parse url as html

    doc2 <-htmlTreeParse(url[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- url[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prT']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='ckT']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='clrs']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='prtn']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='crbs']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yld']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, list(ingredients), instructions)

    #rename the tricky column

    colnames(result)[10] <- 'ingredients'

    #bind data to existing df

    df <- rbind(df, result)
            }

    #return df

    df
}

And here's the solution - I simply added a conditional as follows:

instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)
            if (length(instructions) == 0) {
                    instructions <- xpathSApply(top2[[2]], "//ul[@class='b-list m-circle instrs']", xmlValue)}
  • Note - the number of rows returned in the data scrape will differ for each url as each recipe has a different set of ingredients. If there is a way to save all ingredients to one cell that will solve the problem that would work for me, but I can't figure out how to do it, or if the number of rows differing is actually the problem. – originalgriefster Jan 25 '17 at 23:51
  • sounds like you need to insert `NA` as place holders in some positions – John Coleman Jan 25 '17 at 23:57
  • @JohnColeman does that mean the issue is in fact that it's an unequal number of rows in the rbind? I'm not sure how to set it to insert NA's, doesn't appear that there is one in the rbind help page. Any ideas on how to approach this? Thanks for the comment! – originalgriefster Jan 26 '17 at 00:03
  • You may have as many rows in an rbind as you like and it will save correctly, provided you have all the columns in it. So if you have one of each column and 27 ingredients, then you have a problem. You can save lists into a table. you can target the cell with `df['rownumber', "ingredients"][[1]]` and so on, it allows you to add multiple things to the list in that row, column location, you can add one at [[2]] or [[6]]...and if you save a list initially into the frame, it will not take up multiple rows – sconfluentus Jan 26 '17 at 00:29
  • Looking at it more, I don't think that the problem is one of missing data so much as a mismatch between how you and R are thinking about the data. Could you post some sample urls? Otherwise, we can't reproduce your error. – John Coleman Jan 26 '17 at 00:33
  • BUT...you maybe using the wrong data structure: if you are pulling in xml, leaving it in addressable xml or converting to super usable JSON is just as easy...and then you could store each recipe in a redis cache by name and call it out in a JSON string which can be cut, pasted, publish, or coerced into frames so easily. Here is an online viewer:http://chris.photobooks.com/json/default.htm there are many others. – sconfluentus Jan 26 '17 at 00:33
  • @JohnColeman - I can't find the full list (have logged off the comp I was using), here's one of them: http://www.foodforfitness.co.uk/greek-chicken-tray-bake-recipe.php. Will post the list once I get back to the other comp tomorrow. – originalgriefster Jan 26 '17 at 00:42
  • That looks like a nice recipe. I'll have to try it sometimes :) – John Coleman Jan 26 '17 at 00:45
  • @bethanyP - I'm not sure JSON is what I'm after, I know very little about JSON. I'm trying to store the data in a data.frame so that I can then combine it with other recipes of similar format and create different functions with those dataframes for practise (for example, a function with ingredients and total time as inputs, which returns the recipes I could cook within those limitations). JSON at first glance seems to be a level above what I'm after? – originalgriefster Jan 26 '17 at 00:52
  • Have you considered using SQLite and building a little SQL database? There are some good tutorials and it offers the ability to save a recipe table, an ingredients table, a chef's table and then link them up for wicked easy sorting in R based on any dimension of the various tables. It gets you past the need for things to be rectangular in R data frames AND it allows your data to live in persistent memory on the hard drive only entering precious memory when called by you. You can take a quick SQL class for free on code academy and then use the SQLlite tutorials people post for the rest in R! – sconfluentus Jan 26 '17 at 02:20
  • Thanks, I might try building an SQL database so. I'm still a little confused though, since the data IS rectangular, so the rbind function should work, but it throws an error when I pass a vector of more than one element through it. It should still work, and I'm determined to figure out why it doesn't! – originalgriefster Jan 26 '17 at 13:12
  • I have now edited the question to include a sample of the urls I passed through to reproduce the error. @JohnColeman – originalgriefster Jan 26 '17 at 13:34

1 Answers1

1

I was able to tweak your function so that it works:

f4fscrape <- function(urls) {

  #Create an empty dataframe

  df <- data.frame(matrix(ncol = 11, nrow = 0))
  cnames <- c('address', 'recipename', 'prept', 'cookt',
                'calories', 'protein', 'carbs', 'fat',
                'servings', 'ingredients', 'instructions')

  names(df) <- cnames

  #check for the recipe url in dataframe already,
  #only carry on if not present

  for (i in 1:length(urls)) 
    if (urls[i] %in% df$address) {
      next }
  else {
    #parse url as html

    doc2 <-htmlTreeParse(urls[i], useInternalNodes = TRUE)

    #define the root node

    top2 <- xmlRoot(doc2)

    #scrape relevant data

    address <- urls[i]
    recipename <- xpathSApply(top2[[2]], "//h1[@class='fn']", xmlValue)
    prept <- xpathSApply(top2[[2]], "//span[@class='prepTime']", xmlValue)
    cookt <- xpathSApply(top2[[2]], "//span[@class='cookTime']", xmlValue)
    calories <- xpathSApply(top2[[2]], "//span[@class='calories']", xmlValue)
    protein <- xpathSApply(top2[[2]], "//span[@class='protein']", xmlValue)
    carbs <- xpathSApply(top2[[2]], "//span[@class='carbohydrates']", xmlValue)
    fat <- xpathSApply(top2[[2]], "//span[@class='fat']", xmlValue)
    servings <- xpathSApply(top2[[2]], "//span[@class='yield']", xmlValue)
    ingredients <- xpathSApply(top2[[2]], "//span[@class='ingredient']", xmlValue)
    instructions <- xpathSApply(top2[[2]], "//ol[@class='methodOL']", xmlValue)

    #create a data.frame of the url and relevant data.

    result <- data.frame(address, recipename, prept, cookt, 
                         calories, protein, carbs, fat, 
                         servings, paste0(ingredients, collapse = ", "), instructions, stringsAsFactors = FALSE)


    df <- rbind(df, setNames(result, names(df)))
  }

  #return df

  df
}

Changes:

1) url is a built-in function, so I renamed it urls, similarly for colnames

2) I changed the way the column names were assigned.

3) The loop for (i in length(url)) skips to the last index. I changed it to for (i in 1:length(urls))

4) The condition if (url[i] %in% df$url) referred to a nonexistent column (url). I changed that to address.

5) The most important change: I concatenated the ingredients to a single string using paste0. With what you had, in the 1-url case, each ingredient was put on its own line, and the other columns (by the recycling rule) were just repeated. Run your current code with a single url and View() the result -- it probably isn't what you intended, so it isn't true that "It works when one url is passed to it".

6) With all these long strings, it seemed good to set stringsAsFactors = FALSE.

7) You need to set the names in the data frame when you rbind the new row. See this question.

When you View the result of running the tweaked function on your given list you see the following (although not so zoomed-out of course):

enter image description here

I don't know enough about the XML library to help you with the speed. Sometimes it runs slowly, sometimes fast, so it might have to do mostly with connection speed and be largely beyond your control.

Community
  • 1
  • 1
John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • Excellent solution, and thank you very much for the notes. The for loop is a bad one for me as I actually know this, I just messed up. Concatenating ingredients to one string using paste0 is a perfect option, as I said in the question I'm happy to treat these as one cell then manipulate later if required. I will do some reading on the paste0 function. Not to worry on the speed, it will do me some good to investigate and attempt optimising myself. I suspect it's connection speed, but I know if I use the data.table package it's likely to speed up somewhat. – originalgriefster Jan 26 '17 at 18:32
  • A strange thing has happened here actually - when I run the sample of 5 I gave you, it works. When I run the full list of 30, it reproduces the error ' arguments imply differing number of rows: 1, 0 '. I'll double check my input urls and try to investigate this- perhaps one of them is faulty and thus producing a null set of values? If that's not the problem I'm at a loss as to why it would work for the first 5 but not the full list. EDIT: Actually it might be the html for whichever url doesn't work, I'll test by running them individually and look at the html. – originalgriefster Jan 26 '17 at 18:36
  • @ogriofac Maybe the problem is that `instructions` isn't as expected for some? See if you can find the first `url` that is causing the problem and then run the function with just that url and see what happens. I don't know how `xpathSApply` behaves if one of the classes you are looking for is missing from one of the web pages. If it returns `NA` then it shouldn't be an issue. If it doesn't -- perhaps an optional parameter can be set. As I said in my answer, I'm not really familiar with that package. – John Coleman Jan 26 '17 at 18:53
  • It's that exactly, I ran each url individually, turns out for one of the recipes the page's html is formatted differently, and the instructions aren't of class `'MethodOL'`, rather it's of class `'active'` with a series of `::before` commands to set the order. I will attempt to build a conditional to search for which term is present using xpathsapply, and store the data for the one that's present. Or maybe just use NA as you said. Looks like I picked a good site for building my first scrape function, lots of little issues to figure out! Thanks so much for your help! – originalgriefster Jan 26 '17 at 19:00