I have data from a PDF file that I am reading into R.
library(pdftools)
library(readr)
library(stringr)
library(dplyr)
results <- pdf_text("health_data.pdf") %>%
readr::read_lines()
When I read it in with this method, a character vector is returned. Multi-line information for a given column is spread out on different lines (and not all columns for each observation will have data.
A reproducible example is below:
ex_result <- c("03/11/2012 BES 3RD BES inc and corp no- no- sale -",
" group with sale no- sale",
" boxes",
"03/11/2012 KRS six and firefly 45 mg/dL 100 - 200",
" seven",
"03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87")
I am trying to use read_fwf
with fwf_widths
as I read that it can handle multi-line input if you give the widths for multi-line records.
ex_result_width <- read_fwf(ex_result, fwf_widths(
c(10, 24, 16, 7, 5, 15,100),
c("date", "name","description", "value", "unit","range","ab_flag")))
I determined the sizes by typing in the console nchar
with the longest string that I saw for that column.
Using fwf_widths
I can get the date column by defining in the width =
argument with 10 bytes, but for the NAME column if I set it to say 24 bytes it returns back columns concatenated instead of rows split to account for multi-line which then cascades to the other columns now having the wrong data and the rest being dropped when space has run out.
Ultimately this is the desired output:
desired_output <-tibble(
date = c("03/11/2012","03/11/2012","03/11/2012"),
name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
description = c("BES inc and corp", "firefly", "ladybug"),
value = c("no-sale", "45", "55"),
unit = c("","mg/dL","mg/dL"),
range = c("no-sale no-sale", "100 - 200", "42 - 87"),
ab_flag = c("", "", ""))
I am trying to see:
- How can I get
fwf_widths
to recognize multi-line text and missing columns? - Is there a better way to read in the pdf file to account for multi-line values and missing columns? (I was following this tutorial but it seems to have a more structured pdf file)