2

I'm trying to use pdftools package to extract data table from a pdf. My source file is here: https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf. Say, I want to extract data from Table 20 on page 170 (Change in Nominal house price)

I use the following code:

install.packages("pdftools")
library(pdftools)

report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")

tab20 <- as.data.frame(report[170])

To get the proper table I had to manually indicate that I want to extract 170th element of the list (as the table is on page 170). If next year, a new page with table is added to the report, I will have to modify the code to extract 171th element. Is there a way to do it in a more automated manner?

Basically, what I need to do is to find the element of the list that contains string "Change in Nominal house price". Any suggestion how to do it?

Chris
  • 251
  • 1
  • 7
  • What is the format do you require your data.frame to take ? There is the string in the data.frame you load but the table has a very different format from what is displayed on the pdf. – Gowachin Nov 26 '21 at 13:34
  • 1
    @Gowachin In the end I would like the DF to look like the table in the report. I can further work on cleaning it and reshaping to the proper format using dplyr etc. But I wonder how to get the DF in the first place without specifying manually "170" as a parameter – Chris Nov 26 '21 at 13:41

2 Answers2

1

You can find a string with a corresponding pattern. By using multiple filters you can gather this singular table.

table <- report[grepl('Change', report) & grepl('Nominal', report) &
                grepl('house', report)]

I guess a more subtil regex could work. Also this only work because no other table have the same title, but it could be better to check if it only return a value like below :

place <- grepl('Change', report) &
            grepl('Nominal', report) &
            grepl('house', report)
if(sum(place) != 1){
  stop("There is not only one pattern that match. Adjust pattern.")
} else {
  table <- report[place]
}

EDIT : To speed this up, you better use the @Paul Smith solution. I adapted it with grepl and lapply and it is faster !!! However you need to make sure the title does not change at all.

system.time(
place <- unlist(lapply(report, function(x) grepl("Change in Nominal house price",
                                        paste(x$text, collapse = " "))))
)
#        user      system       spent 
#        0.07        0.00        0.08 
system.time(
place <- grepl('Change', report) & grepl('Nominal', report) &
            grepl('house', report)
)
#        user      system       spent 
#        1.99        0.01        2.03 

Gowachin
  • 1,251
  • 2
  • 9
  • 17
  • I can not find a way to grep multiple word without calling 3 times the function (which can take a long time with large dataset). Any idea to speed this up is welcome ! – Gowachin Nov 26 '21 at 14:40
0

Another solution, based on purrr::map_lgl:

library(tidyverse)
library(pdftools)

report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")

map_lgl(
  report,
  ~ str_detect(
    str_c(.x$text, collapse = " "),
    "Change in Nominal house price")) %>% report[.]

#> [[1]]
#> # A tibble: 606 × 6
#>    width height     x     y space text       
#>    <int>  <int> <int> <int> <lgl> <chr>      
#>  1    59     14    39    38 TRUE  STATISTICAL
#>  2    35     14   102    38 FALSE TABLES     
#>  3    25     26    33    81 TRUE  20.        
#>  4    60     26    65    81 TRUE  Change     
#>  5    15     26   129    81 TRUE  in         
#>  6    67     26   149    81 TRUE  Nominal    
#>  7    47     26   221    81 TRUE  house      
#>  8    41     26   272    81 FALSE price      
#>  9    30     14    65   103 TRUE  Annual     
#> 10     7     14    98   103 TRUE  %          
#> # … with 596 more rows
PaulS
  • 21,159
  • 2
  • 9
  • 26