0

I've found a work around solution to a question I posted based on @Ryan's recommendation, given by this code:

for (i in seq_along(url)){

  webpage <- read_html(url[i]) #loop through URL list to access html data

  fac_data <- html_nodes(webpage,'.tableunder')  %>% html_text()
  fac_data1 <- html_nodes(webpage,'.tableunder1')  %>% html_text()
  fac_data <- c(fac_data, fac_data1) #Store table data on each URL in a variable 

  x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data

  for (j in seq_along(headers[[i]])){
    y <- cbind(x[,j]) #extract column data and store in temporary variable
    colnames(y) <- as.character(headers[[i]][j]) #add column name
    print(cbind(y)) #loop through headers list to print column data in sequence. ** cbind(y) will be overwritten when I try to store the result on a list with 'z <- cbind(y)'.
  }
}

I am now able to print out all values, complete with headers of the data in question.


Some follow-up questions will be:

  1. How do I save the output of cbind(y) cumulatively in a data.frame or a list? Looping through cbind(y) will overwrite values, which leaves me with only the last column from the last table. Like this:

    退休年月

    [1,] "82年8月"

Neither do these variations work:

z[[x]][j] <- cbind(y)

> source('~/Google 云端硬盘/R/scrapeFaculty.R')
Error in `*tmp*`[[x]] : 最多只能選擇一個元素

z[j] <- cbind(y)

> source('~/Google 云端硬盘/R/scrapeFaculty.R')
There were 13 warnings (use warnings() to see them)

z[[j]] <- cbind(y)

> source('~/Google 云端硬盘/R/scrapeFaculty.R')
Error in z[[j]] <- cbind(y) : 用來替換的元素比所要替換的值多
  1. Can the double for-loop be replaced by a simple lapply() function to resolve the above issue?

EDIT:

Here's the final code I used to solve this:

for (i in seq_along(url)){

  webpage <- read_html(url[i])

  fac_data <- html_nodes(webpage,'.tableunder')  %>% html_text()
  fac_data1 <- html_nodes(webpage,'.tableunder1')  %>% html_text()
  fac_data <- c(fac_data, fac_data1)

  x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data
  y <- cbind(x[,1:length(headers[[i]])]) #extract column data
  colnames(y)<- as.character(headers[[i]]) #add colunm name
  ntu.hist[[i]] <- y #Cumulate results on a list.

}
Community
  • 1
  • 1
Sati
  • 716
  • 6
  • 27
  • How and where is the `headers` dataset created – Silence Dogood Sep 07 '17 at 10:28
  • It is created by scraping the column headings from tables on four webpages given by the url list. The html_table function cannot be used in this case because the resultant table has inconsistent number of columns. – Sati Sep 07 '17 at 10:39
  • Can you try this demo example `new_mtcars = do.call(rbind,lapply(1:nrow(mtcars),function(x) { tempDF = mtcars[x,drop=FALSE]; })` and modify it to suit your problem. Read help documentation from `?lapply`, `?do.call` or search for "lapply rbind" – Silence Dogood Sep 07 '17 at 12:26
  • @OdeToMyFiddle The demo doesn't work. > New_mtcars = do.call(rbind,lapply(1:nrow(mtcars),function(x) { tempDF = mtcars[x,drop=FALSE]; })) Error in `[.data.frame`(mtcars, x, drop = FALSE) : undefined columns selected 此外: There were 12 warnings (use warnings() to see them) Called from: `[.data.frame`(mtcars, x, drop = FALSE) – Sati Sep 07 '17 at 13:00
  • Sorry can you try this, `new_mtcars = do.call(rbind,lapply(seq(1,nrow(mtcars),2),function(x) { tempDF = mtcars[x,]; }))` , this will `rbind` only the odd rows, compare this to orginal dataset `mtcars` – Silence Dogood Sep 07 '17 at 15:00

2 Answers2

0

I was wondering if it would be an option to cbind multiple at one time instead of looping. Would any of these syntax options help?

y <– data.frame(col1=c(1:3),col2=c(4:6),col3=c(7:9))

cbind(y[,c(1:3)])

  col1 col2 col3
1    1    4    7
2    2    5    8
3    3    6    9

#In R, you can use ":" to specify a range. So 1,2,3,4 is equal to 1:4.
#If you don't want number 3 in that range, you can use c(1,2,4).

#For example:

cbind(y[,c(1,3)])

  col1  col3
1    1     7
2    2     8
3    3     9
www
  • 4,124
  • 1
  • 11
  • 22
0

Final code:

Here's the final code:

for (i in seq_along(url)){

  webpage <- read_html(url[i])

  fac_data <- html_nodes(webpage,'.tableunder')  %>% html_text()
  fac_data1 <- html_nodes(webpage,'.tableunder1')  %>% html_text()
  fac_data <- c(fac_data, fac_data1)

  x <- fac_data %>% matrix(ncol = length(headers[[i]]), byrow=TRUE) #make matrix to extract column data
  y <- cbind(x[,1:length(headers[[i]])]) #extract column data
  colnames(y)<- as.character(headers[[i]]) #add colunm name
  ntu.hist[[i]] <- y #Cumulate results on a list.

}
Sati
  • 716
  • 6
  • 27