0

I have constructed a data frame by row binding different web-scraped tables.

# html files
filelist <- c("Prod223_2688_00185641_20190930.html","Prod224_0078_SO305092_20191130.html", 
"Prod224_0078_SO305426_20190831.html", "Prod224_0078_SO305431_20190831.html", 
"Prod224_0078_SO305440_20190831.html", "Prod224_0078_SO305451_20200331.html", 
"Prod224_0078_SO306088_20190531.html", "Prod224_0078_SO306098_20180630.html", 
"Prod224_0078_SO306098_20190630.html", "Prod224_0078_SO306411_20190530.html")

# web scraping tables
mydata <- lapply(filelist, function(x) {
  read_html(x) %>% rvest::html_table(fill = T) %>% 
    dplyr::nth(2) 
})

# row binding (adding a new column with row .id)
mydata <- rbindlist(mydata, idcol=T, fill = T) 

I want to create a new column company with the respective name from filelist based on row .id. The company name is the third code in between _. To get something like this:

mydata
 company  id.  X2 ..
00185641    1  .. 
00185641    1  .. 
SO305092    2  .. 
SO305426    3  .. 
SO305426    3  .. 

This may be quite a simple question but I am not confident with functions in R yet. I have seen this similar questions and tried:

mydata2 <- mydata2 %>% mutate(company=lapply(mydata2,filelist))
# and this:
mydata2$company <- rep(paste(filelist), length(mydata2$.id))
Amleto
  • 584
  • 1
  • 7
  • 25

1 Answers1

2

Don't have data to test this on but you can try the following :

library(dplyr)
library(rvest)

mydata <- sapply(filelist, function(x) {
  read_html(x) %>% rvest::html_table(fill = TRUE) %>% 
    dplyr::nth(2) 
}, simplify = FALSE)

mydata <- bind_rows(mydata, .id = ='company')
mydata$company <- sub('.*_(\\w+)_\\w+', '\\1', mydata$company)

We used sapply with simplify = FALSE to get a named list with filelist as names, when we use bind_rows that name is assigned as a new column company. Using regex we extract the relevant part of the data.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213