1

Almost absolute beginner here.

I have a function to scrape a table in PDF (I took and slightly adapted the function from here).

The function is as follows.

scrape_pdf <- function(tables, table_number, number_columns, column_names) {
  data <- tables[table_number]
  data <- trimws(data)
  data <- strsplit(data, "\n")
  data <- data[[1]]
  data <- data[grep("XXX", data):grep("XXX", data)]
  data <- data[2:31]
  data <- str_split_fixed(data, " {2,}", number_columns)
  data <- data.frame(data, stringsAsFactors = FALSE)
  names(data) <- column_names
  
  return(data)
}

My PDF has 198 pages. Each page has a table, with the format being identical on each page. I would like to scrape these 198 pages and then collate the results into a single dataframe.

I thought of looping over this function in the following way, which does not work.

x <- c(1:198)
while(x<=198) {
table[[x]] <- scrape_pdf(tables = mytable,
                      table_number = x,
                      number_columns = 3,
                      column_names = c("XXX",
                                        "XXX",
                                        "XXX"))
x = x+1
}

When I run this, I get the following error message.

Error in `[[<-`(`*tmp*`, i, value = value) : 
  recursive indexing failed at level 3
In addition: Warning message:
In while (x <= 198) { :
  the condition has length > 1 and only the first element will be used

I am sure I am missing at least one or more steps. Would be grateful to anyone who has an idea on how to fix this, or how to do this more efficiently. (I understand the tabulizer function is quite handy, but I have had issues with installing Java).

Many thanks in advance!

srocco
  • 108
  • 7
  • 2
    Your x is a vector .. – MrSmithGoesToWashington Jul 06 '21 at 13:42
  • Thank you! I tried the following: `x <- c(1:198) for(i in x) { table_[[i]] <- scrape_pdf(etc. etc.))`. I get the following error message: "object table_ not found" – srocco Jul 06 '21 at 14:16
  • @srocco `table_` is not found, because it is a misspelling of `table`. – Greg Jul 06 '21 at 14:19
  • @Greg thank you! If I run the same code as above, but with `table[[i]]` instead of `table_[[i]]`, I get the following error message: "error in table[[i]] (...) object of type 'closure' is not subsettable" – srocco Jul 06 '21 at 14:21
  • @srocco Have you already defined `table` somewhere else? That part is not visible in the code. Is it a `list`? – Greg Jul 06 '21 at 14:23
  • Also, a couple things: **(1)** the statement `c(1:198)` is redundant, since `1:198` is already equivalent to the vector `c(1, 2, 3, ..., 197, 198)`; you can either **(2)** iterate over a predefined set (vector) of indices (page numbers) using `for`, as with `for(i in 1:198){...}`; or **(3)** iterate indefinitely using `while` and a scalar index that is updated on each iteration, as with `x <- 1; while(i <= 198){...; x <- x + 1}`. – Greg Jul 06 '21 at 14:29
  • @Greg No actually! My idea (pardon my very poor programming notions) was to create a number of tables, one for each of the 198 pages of my PDF, using the `scrape_pdf` function shown above. This might very well be a naïve approach to the issue :-) – srocco Jul 06 '21 at 14:29
  • @srocco It's not the _worst_ idea I've ever seen (believe me), but I'm still trying to figure out what the variable `table` actually _is_. Unless it **already exists** as a `list` of objects, you can't store the results of `scrape_pdf()` within it at position `x`: `table[[x]] <- scrape_pdf(...)`. – Greg Jul 06 '21 at 14:35
  • @Greg No, and again very sorry for my own confusion, `table` does not exist already. I thought I could create 198 tables (in my naïvety, `table_1` to `table_198`, hence the loop) by simply looping over the `scrape_pdf()` function 198 times, each time specifying, within the `table_number` argument, which page of the PDF the function should focus on. – srocco Jul 06 '21 at 14:42
  • @srocco You **can** do that, but your syntax `table[[x]] <- ...` presumes that a `list` named `table` already exists, and that you are trying to store something as the `x`th element in that list. Do you really want 198 separate variables, of the form `table_*`, floating around in your environment? IMO, it's far better to store them all as entries in a `list` of 198 elements, which can all be renamed according to the `table_*` convention. **Furthermore**, if you want to consolidate everything into a single table, you can just use `rbind` within your loop, to consecutively append each table. – Greg Jul 06 '21 at 14:44
  • @Greg Haha that's an excellent point. But just for me to understand - should I still want to have the 198 separate variables, how would my syntax look like? Thank you so much for all the help and sorry to be a pain!! – srocco Jul 06 '21 at 15:07
  • @srocco To dynamically name variables, check out [this comment](https://stackoverflow.com/questions/2679193/2679289#comment15326407_2679289) in response to [this answer](https://stackoverflow.com/a/2679289). You'd be using the `assign()` function to create variables whose names are generated by `paste()`, and whose values are given by `scrape_pdf()`. – Greg Jul 06 '21 at 15:10
  • @Greg Many thanks for all the help!!! – srocco Jul 06 '21 at 15:25
  • @srocco Happy to help! – Greg Jul 06 '21 at 15:26

0 Answers0