I have a long list of 5,000+ .htm files, each representing a firm, and I would like to merge them to get a dataset that binds them all.
I have been trying to load the data in R as follows (the extension is *.xls due to a wrong name in the origin, but the true file extensions are *.htm)
library(readxl)
file.list <- list.files(pattern='*.xls')
library(rvest)
df.list <- lapply(file.list, read_html)
But I only get a list of objects and I don't know how to analyze them in R as I would have an observation per row.
If I run the following code
data<-read_html("data.xls")
data<-html_table(data, fill=TRUE)[[1]]
data<-data[-1,] #to remove the first row with column names
data<-as.data.frame(data) here
I get what I want, but only for one file, so then I would need to repeat this for all the files manually.
Any help?
Thanks m