2

I have data from a PDF file that I am reading into R.

library(pdftools)
library(readr)
library(stringr)
library(dplyr)

results <- pdf_text("health_data.pdf") %>% 
  readr::read_lines()

When I read it in with this method, a character vector is returned. Multi-line information for a given column is spread out on different lines (and not all columns for each observation will have data.

A reproducible example is below:

ex_result <- c("03/11/2012 BES 3RD          BES inc and corp           no-            no- sale -",
  "           group with                           sale        no- sale",  
  "           boxes",                                                                   
  "03/11/2012 KRS six and    firefly                  45       mg/dL  100 - 200",        
  "           seven",                                                                   
  "03/11/2012 KRS core    ladybuyg            55       mg/dL  42 - 87")

I am trying to use read_fwf with fwf_widths as I read that it can handle multi-line input if you give the widths for multi-line records.

ex_result_width <- read_fwf(ex_result, fwf_widths(
  c(10, 24, 16, 7, 5, 15,100), 
  c("date", "name","description", "value", "unit","range","ab_flag")))

I determined the sizes by typing in the console nchar with the longest string that I saw for that column.

Using fwf_widths I can get the date column by defining in the width = argument with 10 bytes, but for the NAME column if I set it to say 24 bytes it returns back columns concatenated instead of rows split to account for multi-line which then cascades to the other columns now having the wrong data and the rest being dropped when space has run out.

Ultimately this is the desired output:

desired_output <-tibble(
  date = c("03/11/2012","03/11/2012","03/11/2012"),
  name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
  description = c("BES inc and corp", "firefly", "ladybug"),
  value = c("no-sale", "45", "55"),
  unit = c("","mg/dL","mg/dL"),
  range = c("no-sale no-sale", "100 - 200", "42 - 87"),
  ab_flag = c("", "", ""))

I am trying to see:

  1. How can I get fwf_widths to recognize multi-line text and missing columns?
  2. Is there a better way to read in the pdf file to account for multi-line values and missing columns? (I was following this tutorial but it seems to have a more structured pdf file)
daneshjai
  • 858
  • 3
  • 10
  • 17
  • 2
    I don't think there's any way to to get `read_fwf` to read records spread across multiple lines. You'll have to manipulate the input data to combine the values onto a single row. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Apr 26 '20 at 22:21
  • I've spent a couple of hours on a similar task and ended up using 'pdftotext' (https://pypi.org/project/pdftotext/) then importing into R. It's not the best solution but it might be useful for you. – jared_mamrot Apr 26 '20 at 23:28
  • @jpmam1 trying to avoid python, if possible. But curious, did the python pdf tool just pick-up the structure? – daneshjai Apr 27 '20 at 21:41
  • @MrFlick updated the question to include code to create the input, attempt, and output. – daneshjai Apr 27 '20 at 21:41
  • @daneshjai I used pdftotext to extract a large table from the supplementary section of a scientific article - the format was retained (the rows and columns were not lost) despite the table running over multiple pages. I'm not sure how you would get the same result using R, sorry. – jared_mamrot Apr 28 '20 at 06:27

1 Answers1

0

str_subset(ex_result,pattern = "\/\d{2}\/") [1] "03/11/2012 BES 3RD BES inc and corp no- no- sale -" [2] "03/11/2012 KRS six and firefly 45 mg/dL 100 - 200"
[3] "03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87"

itellin
  • 1
  • 1
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 06 '22 at 09:44