Almost absolute beginner here.
I have a function to scrape a table in PDF (I took and slightly adapted the function from here).
The function is as follows.
scrape_pdf <- function(tables, table_number, number_columns, column_names) {
data <- tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[grep("XXX", data):grep("XXX", data)]
data <- data[2:31]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data, stringsAsFactors = FALSE)
names(data) <- column_names
return(data)
}
My PDF has 198 pages. Each page has a table, with the format being identical on each page. I would like to scrape these 198 pages and then collate the results into a single dataframe.
I thought of looping over this function in the following way, which does not work.
x <- c(1:198)
while(x<=198) {
table[[x]] <- scrape_pdf(tables = mytable,
table_number = x,
number_columns = 3,
column_names = c("XXX",
"XXX",
"XXX"))
x = x+1
}
When I run this, I get the following error message.
Error in `[[<-`(`*tmp*`, i, value = value) :
recursive indexing failed at level 3
In addition: Warning message:
In while (x <= 198) { :
the condition has length > 1 and only the first element will be used
I am sure I am missing at least one or more steps. Would be grateful to anyone who has an idea on how to fix this, or how to do this more efficiently. (I understand the tabulizer
function is quite handy, but I have had issues with installing Java).
Many thanks in advance!