Scraping PDF tables with empty Cells

Question

I'm using R to pull data from PDFs and so far it has been going well. I just opened up a new batch of PDFs and saw that I have to figure out how to account for empty cells. I haven't found a way to do this, and I have hundreds of pages that I need to go through.

I've included some sample data. I haven't found a way to attach the PDFs here, and these are not posted on the web anywhere. I saved df as a CSV, then copied and pasted that into a word document which I saved as a CSV for this example. Screenshot attached as well.

library(pdftools)
library(tidyverse)

# Example data
df <- data.frame("rows" = c("row1", "row2", "row3", "row4", "row5", "row6", "row7", "row8", "row9", "row10"),
                 "col1" = c(1, 2, "", 4, 5, 6, 7, 8, 9, 10),
                 "col2" = c(1, 2, 3, 4, "", "", 7, 8, 9, ""),
                 "col3" = c(1, 2, "", 4, 5, 6, 7, 8, 9, 10),
                 "col4" = c(1, 2, 3, 4, 5, 6, 7, "", 9, 10),
                 "col5" = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                 "col6" = c(1, 2, "", "", 5, 6, 7, "", 9, 10),
                 "col7" = c(1, 2, 3, 4, 5, "", 7, 8, 9, 10),
                 "col8" = c(1, "", 3, 4, 5, 6, 7, "", 9, 10),
                 "col9" = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
                 )

# Save example data, then save as a PDF outside of R.
# write_csv(df, "sample_data.csv")


# read in the PDF
pdf_file <- pdf_text("sample_data.pdf")

data <- pdf_file[1]
data <- trimws(data)
data <- strsplit(data, "\r\n")
data <- data[[1]]
data <- str_split_fixed(data, " {2,}", 10)  ## I think this is the step that needs to change
data <- data.frame(data, stringsAsFactors = FALSE)



# Print out outs of the data for reference. 
> data
      X1   X2   X3   X4   X5   X6   X7   X8   X9  X10
1   rows col1 col2 col3 col4 col5 col6 col7 col8 col9
2   row1    1    1    1    1    1    1    1    1    1
3   row2    2    2    2    2    2    2    2    2     
4   row3    3    3    3    3    3    3               
5   row4    4    4    4    4    4    4    4    4     
6   row5    5    5    5    5    5    5    5    5     
7   row6    6    6    6    6    6    6    6          
8   row7    7    7    7    7    7    7    7    7    7
9   row8    8    8    8    8    8    8               
10  row9    9    9    9    9    9    9    9    9    9
11 row10   10   10   10   10   10   10   10   10   


 df
    rows col1 col2 col3 col4 col5 col6 col7 col8 col9
1   row1    1    1    1    1    1    1    1    1    1
2   row2    2    2    2    2    2    2    2         2
3   row3         3         3    3         3    3    3
4   row4    4    4    4    4    4         4    4    4
5   row5    5         5    5    5    5    5    5    5
6   row6    6         6    6    6    6         6    6
7   row7    7    7    7    7    7    7    7    7    7
8   row8    8    8    8         8         8         8
9   row9    9    9    9    9    9    9    9    9    9
10 row10   10        10   10   10   10   10   10   10

UPDATE: Adding dput(pdf_file)

> dput(pdf_file)
"rows  col1    col2   col3    col4    col5    col6    col7    col8    col9\r\nrow1        1      1       1       1       1       1       1       1       1\r\nrow2        2      2       2       2       2       2       2               2\r\nrow3               3               3       3               3       3       3\r\nrow4        4      4       4       4       4               4       4       4\r\nrow5        5              5       5       5       5       5       5       5\r\nrow6        6              6       6       6       6               6       6\r\nrow7        7      7       7       7       7       7       7       7       7\r\nrow8        8      8       8               8               8               8\r\nrow9        9      9       9       9       9       9       9       9       9\r\nrow10      10             10      10      10      10      10      10      10\r\n"

You can see that there is a difference between df and data at this point. I've tried playing around with a few things and I haven't been able to make anything work well enough to post here. I tried using some if/else logic to say that if there were 3 or more spaces, insert NA, but that just caused a bunch of errors so I abandoned that approach. My goal is to get the data as close to df as possible.

score 0 · Answer 1 · answered Apr 01 '21 at 02:09

0

This looks like a good scenario to use the tabulizer package. It works really well when there are nicely formatted tables like this in the PDF. See the vignette. The best function here for you would be tabulizer::extract_tables. It should also recognize the blank spaces as empty values assuming the PDFs are all well formatted like this.

answered Apr 01 '21 at 02:09

mikeytop

150
9

Unfortunately, not all of the data is as well formatted as this, and I've had all sorts of trouble with tabulizer. It might be an option for some of them though, so I'll give it a shot. – pkpto39 Apr 01 '21 at 15:54

Ronak Shah · Accepted Answer · 2021-04-02T08:01:47.407

0

Try using read.fwf as a fixed-width file.

data <- pdf_file[1]
data <- trimws(data)
data <- strsplit(data, "\r\n")
data <- data[[1]]
writeLines(data, 'temp.txt')
result <- read.fwf('temp.txt', c(11, 2, rep(8, 8)), skip = 1, strip.white = TRUE)
names(result) <- scan(text = readLines('temp.txt', n = 1), what = character())
result

#    rows col1 col2 col3 col4 col5 col6 col7 col8 col9
#1   row1    1    1    1    1    1    1    1    1    1
#2   row2    2    2    2    2    2    2    2   NA    2
#3   row3   NA    3   NA    3    3   NA    3    3    3
#4   row4    4    4    4    4    4   NA    4    4    4
#5   row5    5   NA    5    5    5    5    5    5    5
#6   row6    6   NA    6    6    6    6   NA    6    6
#7   row7    7    7    7    7    7    7    7    7    7
#8   row8    8    8    8   NA    8   NA    8   NA    8
#9   row9    9    9    9    9    9    9    9    9    9
#10 row10   10   NA   10   10   10   10   10   10   10

edited Apr 02 '21 at 08:01

answered Apr 01 '21 at 06:13

Ronak Shah

377,200
20
156
213

Thank you. This gives my data headers but does not help to maintain the position of those empty cells in the dataframe, which is what I'm really after here. – pkpto39 Apr 01 '21 at 15:55
Yeah..It's quite difficult to adjust the answer when I don't have the data. Can you include the output of `dput(pdf_file)`in your post @pkpto39 ? – Ronak Shah Apr 02 '21 at 01:42
Yes, I have added that above. – pkpto39 Apr 02 '21 at 02:14
@pkpto39 Can you try the updated answer using `read.fwf`, it worked for me. – Ronak Shah Apr 02 '21 at 08:02
Thank you. It took some playing around with the widths, but I eventually got this to work. Do you have any suggestions on best determining column widths? – pkpto39 Apr 05 '21 at 23:55

Scraping PDF tables with empty Cells

2 Answers2

Linked