R Find element of the list to extract table from pdf

Question

I'm trying to use pdftools package to extract data table from a pdf. My source file is here: https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf. Say, I want to extract data from Table 20 on page 170 (Change in Nominal house price)

I use the following code:

install.packages("pdftools")
library(pdftools)

report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")

tab20 <- as.data.frame(report[170])

To get the proper table I had to manually indicate that I want to extract 170th element of the list (as the table is on page 170). If next year, a new page with table is added to the report, I will have to modify the code to extract 171th element. Is there a way to do it in a more automated manner?

Basically, what I need to do is to find the element of the list that contains string "Change in Nominal house price". Any suggestion how to do it?

What is the format do you require your data.frame to take ? There is the string in the data.frame you load but the table has a very different format from what is displayed on the pdf. — Gowachin, Nov 26 '21 at 13:34
@Gowachin In the end I would like the DF to look like the table in the report. I can further work on cleaning it and reshaping to the proper format using dplyr etc. But I wonder how to get the DF in the first place without specifying manually "170" as a parameter — Chris, Nov 26 '21 at 13:41

Gowachin · Answer 1 · 2021-11-26T14:51:08.833

You can find a string with a corresponding pattern. By using multiple filters you can gather this singular table.

table <- report[grepl('Change', report) & grepl('Nominal', report) &
                grepl('house', report)]

I guess a more subtil regex could work. Also this only work because no other table have the same title, but it could be better to check if it only return a value like below :

place <- grepl('Change', report) &
            grepl('Nominal', report) &
            grepl('house', report)
if(sum(place) != 1){
  stop("There is not only one pattern that match. Adjust pattern.")
} else {
  table <- report[place]
}

EDIT : To speed this up, you better use the @Paul Smith solution. I adapted it with grepl and lapply and it is faster !!! However you need to make sure the title does not change at all.

system.time(
place <- unlist(lapply(report, function(x) grepl("Change in Nominal house price",
                                        paste(x$text, collapse = " "))))
)
#        user      system       spent 
#        0.07        0.00        0.08 
system.time(
place <- grepl('Change', report) & grepl('Nominal', report) &
            grepl('house', report)
)
#        user      system       spent 
#        1.99        0.01        2.03

I can not find a way to grep multiple word without calling 3 times the function (which can take a long time with large dataset). Any idea to speed this up is welcome ! — Gowachin, Nov 26 '21 at 14:40

PaulS · Accepted Answer · 2021-11-26T14:30:57.387

Another solution, based on purrr::map_lgl:

library(tidyverse)
library(pdftools)

report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")

map_lgl(
  report,
  ~ str_detect(
    str_c(.x$text, collapse = " "),
    "Change in Nominal house price")) %>% report[.]

#> [[1]]
#> # A tibble: 606 × 6
#>    width height     x     y space text       
#>    <int>  <int> <int> <int> <lgl> <chr>      
#>  1    59     14    39    38 TRUE  STATISTICAL
#>  2    35     14   102    38 FALSE TABLES     
#>  3    25     26    33    81 TRUE  20.        
#>  4    60     26    65    81 TRUE  Change     
#>  5    15     26   129    81 TRUE  in         
#>  6    67     26   149    81 TRUE  Nominal    
#>  7    47     26   221    81 TRUE  house      
#>  8    41     26   272    81 FALSE price      
#>  9    30     14    65   103 TRUE  Annual     
#> 10     7     14    98   103 TRUE  %          
#> # … with 596 more rows

R Find element of the list to extract table from pdf

2 Answers2