0

I'm trying to code a loop for web scraping.

The loop does : for each name in a list, finds some metrics on a webpage dedicated to this name, and builds a dataframe with all the names and related metrics.

Here is the code :

map_df(1:40, function(i) { 
  link = read_html(paste(link2,names[i], sep = ""))
  htmlnodes = html_nodes(link, ".col_2")
  htmltext = html_text(htmlnodes) 
  datatable = data.table(htmltext)  
  data.table(name = names[i],                
             Var1 = datatable$htmltext[as.numeric(which(grepl("Var1", datatable$htmltext))+1)], 
             Var2 = datatable$htmltext[as.numeric(which(grepl("Var2", datatable$htmltext)) +1)], 
             Var3 = datatable$htmltext[as.numeric(which(grepl("Var3", datatable$htmltext)) +1)],
             Var4 = datatable$htmltext[as.numeric(which(grepl("Var4", datatable$htmltext)) +1)],
             Var5 = datatable$htmltext[as.numeric(which(grepl("Var5", datatable$htmltext)) +1)],
             Var6 = datatable$htmltext[as.numeric(which(grepl("Var6", datatable$htmltext)) +1)], 
             Var7 = datatable$htmltext[as.numeric(which(grepl("Whitelist/Var7", datatable$htmltext)) +1)], 
           stringsAsFactors = FALSE) 
}) -> Mydata

(The reason I use the which/Grepl functions is because all the retrieved data is in a single column and the value of each metric is 1 row below the name of the metric).

I checked with fewer metrics, and the loop works.

But I get the following error message :

Error in data.table(name = names[i], Var1 = datatable$htmltext[as.numeric(which(grepl("Var1",  : 
  Item 8 has no length. Provide at least one item (such as NA, NA_integer_ etc) to be repeated to match the 1 rows in the longest column. Or, all columns can be 0 length, for insert()ing rows into.

I guess it means I have to implement an ifelse function for when the loop does not find metrics such as "hardcap" or "country" on a particular webpage for the ith item, but i don't know how to.

Thanks for your help :)

user2554330
  • 37,248
  • 4
  • 43
  • 90
JBR
  • 21
  • 6
  • `which` returns an integer vector, so I don't think you need the `as.numeric` calls – C. Braun Feb 15 '18 at 21:35
  • Thank you for the heads up :) . But it does not solve my error problem :/. Any idea ? – JBR Feb 15 '18 at 23:24
  • There may be a better way to do it, but you could try something like: `Var6 = datatable$htmltext[ifelse(grep("Var6", datatable$htmltext), which(grepl("Var6", datatable$htmltext)) + 1, NA)]`. You will get more help on your question if you can provide the data you are working with. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – C. Braun Feb 15 '18 at 23:39
  • Thanks for the tip, unfortunately it does not work. I received the same error message. The data of "datatable" is a single row with var1, value of var1, var2, value of var2 etc. That's why I'm trying to search for Var1, Var2 etc. and get the values of rows below. – JBR Feb 16 '18 at 00:24

0 Answers0