tips for reestablishing columns of a scraped web pdf

Question

I'm having trouble scraping this web pdf table:

https://www.usbr.gov/mp/cvo/vungvari/milfln.pdf

into a data.frame. Here's what I have so far:

library(pdftools)
df <- "https://www.usbr.gov/mp/cvo/vungvari/milfln.pdf"
df  <- pdf_text(df)

From here, it looks like all the data were scraped but the column structure is lost. What are some recommended ways to reestablish the columns (or scrape altogether differently without pdftools)? Once the columns are established, I think I'd be able to format/tidy the data. Thank you.

Tabulizer package instead? https://cran.r-project.org/web/packages/tabulizer/tabulizer.pdf — Reeza, Jun 12 '19 at 21:51
also see https://stackoverflow.com/questions/44141160/recognize-pdf-table-using-r — Ben, Jun 12 '19 at 23:45

cory · Accepted Answer · 2019-06-13T13:50:47.830

Your code returns the pdf as text. The table is in there, you just have to get it out.

library(pdftools)
file <- "https://www.usbr.gov/mp/cvo/vungvari/milfln.pdf"
rawtext  <- pdf_text(file)

start <- "\r\n    1"
end <- "\r\n  TOTALS"
df <- read.table(text=substring(rawtext, regexpr(start, rawtext), regexpr(end, rawtext)))
> head(df)
  V1     V2     V3     V4      V5      V6     V7     V8    V9     V10    V11    V12    V13
1  1 41,633 18,956 86,010 100,117 121,958 24,176 40,473 3,739 437,062 +1,443   +728  6,904
2  2 41,867 20,547 86,241 100,249 122,481 24,282 40,635 3,715 440,017 +2,955 +1,490  6,874
3  3 42,212 22,371 86,148 100,249 122,882 23,753 40,841 3,739 442,195 +2,178 +1,098  8,201
4  4 43,002 25,223 86,460 100,396 123,395 23,734 41,059 3,722 446,991 +4,796 +2,418  9,363
5  5 44,225 29,253 86,923 100,601 123,804 23,602 41,356 3,680 453,444 +6,453 +3,253 11,137
6  6 45,978 34,036 87,529 101,234 123,954 23,548 41,642 3,643 461,564 +8,120 +4,094 12,288
     V14     V15
1  7,632 1,554.8
2  8,364 1,571.4
3  9,299 1,589.8
4 11,781 1,613.2
5 14,390 1,641.7
6 16,382 1,674.2

regexpr returns the index in the string where start and end are. substring uses those indices to limit the text to just the table. Then read.table formats the text as a data.frame. You'll have to add column headers yourself. You'll probably also want to remove commas and change the columns to numeric too.

Somehow two days later it is no longer working and I'm unable to debug - `Error in read.table(text = substring(rawtext, regexpr(start, rawtext), : no lines available in input` . I thought there was probably a change in the source structure/format, but at least it looks like there's still the same bounding n 1 and n TOTALS (even though they are slightly auto-reformatted in this comment) — dbo, Jun 15 '19 at 19:42
Thanks @cory. It works for me now too, on my work computer (and not from my home computer over the weekend), so it's making me think there may be another library dependency we didn't list, or that the source text/text format can change slightly, and now it's back to what it originally was. I'll try again tonight from home computer and report back. That said - one more question - can you elaborate a little on adding the `/r` prefix for `start` and `end`? I don't see that in the `rawtext` and can't explain why we need it. — dbo, Jun 17 '19 at 15:44
So, maybe it's a version thing of R/RStudio/library, but on one machine for `rawtext` I see `\r` preceding the `\n` in the `start` and `end` areas (as above in the answer) and on the other those `\r` are not visible. Curious if anyone knows why? If I simply remove the `\r` the code works on both machines. — dbo, Jun 18 '19 at 14:14

tips for reestablishing columns of a scraped web pdf

1 Answers1