4

The end goal is to use the pdftools package to efficiently move through a thousand pages of pdf documents to consistently, and safely, produce a useable dataframe/tibble. I have attempted to use the tabulizer package, and pdf_text functions, but the results were inconsistent. Therefore, started working through the pdf_data() function, which I prefer.

For those unfamiliar with the pdf_data function, it converts a pdf page into a coordinate grid, with the 0,0 coordinate being in the upper-left corner of the page. Therefore, by arranging the x,y coordinates, then pivoting the document into a wide format, all of the information is displayed as it would on the page, only with NAs for whitespaces

Here is a simple example using the familiar mtcars dataset.

library(pdftools)
library(tidyverse)
library(janitor)

pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"

mtcars_pdf_df <- pdf_data(pdf_file)[[1]]

mtcars_pdf_df%>%
  arrange(x, y)%>%
  pivot_wider(id_cols = y, names_from = x, values_from = text)%>%
  unite(col = Car_type, `154`:`215`, sep = " ", remove = TRUE,  na.rm = TRUE)%>%
  arrange(y)%>%
  rename("Page Number" = `303`)%>%
  unite(col =  mpg, `253`:`254`, sep = "", remove = TRUE, na.rm = TRUE)%>%
  unite(col = cyl, `283` : `291` , sep = "", remove = TRUE, na.rm = TRUE)%>%
  unite(col = disp, `308` : `313`, sep = "", remove = TRUE, na.rm = TRUE)

It would be nice to not use a dozen or so unite functions in order to rename the various columns. I used the janitor package row_to_names() function at one point to convert row 1 to column names, which worked well but maybe someone has a better thought?

The central problem; removing the NAs from the dataset through uniting multiple columns, or shifting columns over so that NAs are filled by adjacent columns.

I'm trying to make this efficient. Possible using the purrr package? any help with making this process more efficient would be very appreciated.

The only information I had on the pdf_data() function going into this is from here... https://ropensci.org/technotes/2018/12/14/pdftools-20/ Any additional resources would also be greatly appreciated (apart from the pdftools package help documentation/literature).

Thanks everyone! I hope this also helps others use the pdf_data() too :)

James Crumpler
  • 192
  • 1
  • 8

2 Answers2

9

Here is one approach that could perhaps be generalised if you know the PDF is a reasonably neat table...

library(pdftools)
library(tidyverse)

pdf_file <- "https://github.com/ropensci/tabulizer/raw/master/inst/examples/data.pdf"

df <- pdf_data(pdf_file)[[1]]

df <- df %>% mutate(x = round(x/3),        #reduce resolution to minimise inconsistent coordinates
                    y = round(y/3)) %>% 
  arrange(y, x) %>%                        #sort in reading order
  mutate(group = cumsum(!lag(space, default = 0))) %>%  #identify text with spaces and paste
  group_by(group) %>% 
  summarise(x = first(x),
            y = first(y),
            text = paste(text, collapse = " ")) %>% 
  group_by(y) %>% 
  mutate(colno = row_number()) %>%         #add column numbers for table data 
  ungroup() %>% 
  select(text, colno, y) %>% 
  pivot_wider(names_from = colno, values_from = text) %>% #pivot into table format
  select(-y) %>% 
  set_names(c("car", .[1,-ncol(.)])) %>%   #shift names from first row
  slice(-1, -nrow(.)) %>%                  #remove names row and page number row
  mutate_at(-1, as.numeric)

df
# A tibble: 32 x 12
   car                 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4          21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 Wag      21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710         22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 Drive     21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Sportabout  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant            18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360         14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D          24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230           22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280           19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ... with 22 more rows
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • 1
    Nice. Of course, this can't be a general solution for all tables, and again falls into the "twiddling parameters" category (dividing by three before rounding). However, it seems fairly close to what the OP was looking for, so I'll award you the bounty purely for effort. Thanks. – Allan Cameron Feb 17 '20 at 11:12
3

I'll present a partial solution here, but please allow me to give you some background information first.

I am currently writing a pdf text / table extraction package from scratch in C++ with R bindings, which has required many months and many thousands of lines of code. I started writing it pretty much to do what you are looking to do: reliably extract tabular data from pdfs. I have got it to the point where it can quickly and reliably extract the text from a pdf document, with the associated positions and font of each text element (similar to pdftools).

I assumed that the technical part of reading the xrefs, handling encryption, writing a deflate decompressor, parsing the dictionaries, tokenizing and reading the page description programs would be the real challenges, and that figuring out a general algorithm to extract tabular data was just a detail I would figure out at the end.

Let me tell you, I'm stuck. I can assure you there is no simple, generalizable parsing function that you can write in a few lines of R to reliably extract tabular data from a pdf.

You have three options, as far as I can tell:

  1. Stick to documents where you know the exact layout
  2. Write a function with filter parameters that you can twiddle and check the results
  3. Use a very complex / AI solution to get very good (though never perfect) reliability

For the pdf example you provided, something like the following works fairly well. It falls into the "twiddling parameters" category, and works by cutting the text into columns and rows based on the density function of the x and y co-ordinates of the text elements.

It could be refined a great deal to generalize it, but that would add a lot of complexity and would have to be tested on lots of documents

tabulize <- function(pdf_df, filter = 0.01)
{
  xd <- density(pdf_df$x, filter)
  yd <- density(pdf_df$y, filter)
  pdf_df$col <- as.numeric(cut(pdf_df$x, c(xd$x[xd$y > .5] - 2, max(xd$x) + 3)))
  pdf_df$row <- as.numeric(cut(pdf_df$y, c(yd$x[yd$y > .5] - 2, max(yd$x) + 3)))
  pdf_df %<>% group_by(row, col) %>% summarise(label = paste(text, collapse = " "))
  res <- matrix(rep("", max(pdf_df$col) * max(pdf_df$row)), nrow = max(pdf_df$row))
  for(i in 1:nrow(pdf_df)) res[pdf_df$row[i], pdf_df$col[i]] <- pdf_df$label[i]
  res <- res[which(apply(r, 1, paste, collapse = "") != ""), ]
  res <- res[,which(apply(r, 2, paste, collapse = "") != "")]
  as.data.frame(res[-1,])
}

which gives the following result:

tabulize(mtcars_pdf_df)
#>                     V1   V2  V3    V4  V5   V6    V7    V8 V9 V10 V11 V12
#> 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0   1   4   4
#> 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0   1   4   4
#> 3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1   1   4   1
#> 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1   0   3   1
#> 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0   0   3   2
#> 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1   0   3   1
#> 7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0   0   3   4
#> 8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1   0   4   2
#> 9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1   0   4   2
#> 10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1   0   4   4
#> 11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1   0   4   4
#> 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0   0   3   3
#> 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0   0   3   3
#> 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0   0   3   3
#> 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0   0   3   4
#> 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0   0   3   4
#> 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0   0   3   4
#> 18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1   1   4   1
#> 19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1   1   4   2
#> 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1   1   4   1
#> 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1   0   3   1
#> 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0   0   3   2
#> 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0   0   3   2
#> 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0   0   3   4
#> 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0   0   3   2
#> 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1   1   4   1
#> 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0   1   5   2
#> 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1   1   5   2
#> 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0   1   5   4
#> 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0   1   5   6
#> 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0   1   5   8
#> 32          Volvo 142E 21.4 4 1 121.0 109 4.11 2.780 18.60  1   1   4   2
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Nice work, and I really hope that pdf table extraction becomes more reliable due to the work you're putting into the package! At the moment, I think I'll continue exploring with the tidyverse `coalesce()` function, since it can be used to fill in gaps anywhere there is an NA. Using your function, then coalesce for spot specific cleaning may work best – James Crumpler Feb 09 '20 at 14:12