Text mining from PDF form to dataframe using R

Question

I'm working on data mining from Electronic Health Records for my research. I've got a bunch of PDF files with lab results and patient data, which I can't seem to process it properly into a data frame.

I've used library(pdftools), as recommended by previous topics on the same subject, but I need a clue on how to proceed from here.

library(tidyverse)
library(pdftools)
library(tidyverse)
UC_text <- pdf_text("teste01.pdf")  %>% readr:: read_lines()

edit: as suggested below, I'm pasting dput(UC_text)instead

dput(UC_text)
c("GLICOSE                                            77 mg/dL                           (21/04/2018): 77 (06/01/2017): 79 (01/11/2014): 73", 
"Amostra: Soro", "Método: Enzimático Automatizado", "Valores de referência em jejum: 70 a 99 mg/dL", 
"(Conforme Diretrizes da Sociedade Brasileira de Diabetes - 2019- 2020)", 
"Data da Colheita/Recebimento: 05/02/2020         Horário da Colheita/Recebimento: 07:47:03", 
"Data da Liberação do Resultado: 05/02/2020       Horário da Liberação do Resultado: 09:49:05", 
"SUPRESSED INFO", 
"UREIA                                               27 mg/dL", 
"Amostra: Soro", "Método: Enzimático Automatizado", "Valores de Referência: 15 a 45 mg/dL", 
"(valor para a faixa etária deste paciente)", "Data da Colheita/Recebimento: 05/02/2020         Horário da Colheita/Recebimento: 07:47:03", 
"Data da Liberação do Resultado: 05/02/2020       Horário da Liberação do Resultado: 09:49:06", 
"SUPRESSED INFO", 
"CREATININA                                         0,87 mg/dL", 
"Amostra: Soro", "Método: Cinético Automatizado", "Valores de Referência: 0,3 a 1,3 mg/dL ( IFCC )", 
"(Valores para a faixa etária deste paciente)", "Data da Colheita/Recebimento: 05/02/2020         Horário da Colheita/Recebimento: 07:47:03", 
"Data da Liberação do Resultado: 05/02/2020       Horário da Liberação do Resultado: 09:49:05", 
"CONFERÊNCIA E LIBERAÇÃO ELETRÔNICA: AMANDA DA SILVA CURTO - CRBM 5 - 1966", 
"ÁCIDO ÚRICO                                         5,2 mg/dL", 
"Amostra: Soro", "Método: Enzimático Automatizado", "Valores de Referência: Homens: 3,7 a 9,2 mg/dL", 
"                            Mulheres: 3,1 a 7,8 mg/dL", "Data da Colheita/Recebimento: 05/02/2020         Horário da Colheita/Recebimento: 07:47:03", 
"Data da Liberação do Resultado: 05/02/2020       Horário da Liberação do Resultado: 09:49:04", 
"SUPRESSED INFO")

I'm interested in creating a data frame in which each exam type and its correspondent information is a column and each row correspond to each patient id and each cell is the value result of said exam with its correspondent values, e.g.

  id_patient dt_colct time_colct de_exame_type de_analito de_result cd_unidade
  <chr>       <chr>     <chr>     <chr>    <chr>      <chr>        <chr>     
1 1d71815e54… 15/05/20… 07:47:03      Glicose… Soro… 120           µL        
2 d9775c209f… 14/04/20… 07:47:03      UreiaT… Nitrito    Negativo     NULL      
3 d9775c209f… 07/04/20… 07:47:03      Creati… Neutrófil… 96.0         %         
4 d9775c209f… 09/04/20… 07:47:03      Acido.… Saturação… 97.8         %         
5 d9775c209f… 13/04/20… 07:47:03      Diferen… Monócitos… 1673         µL        
6 d9775c209f… 17/04/20… 07:47:03      Hemogra… Eosinófil… 0.1          %

I've tried first separating into different values by using

UC_text[c(1, 6, 7, 9, 10, 14, 17,18, 25, 26, 29)] %>% str_squish()

which gives

[1] "GLICOSE 77 mg/dL (21/04/2018): 77 (06/01/2017): 79 (01/11/2014): 73"                   
 [2] "Data da Colheita/Recebimento: 05/02/2020 Horário da Colheita/Recebimento: 07:47:03"    
 [3] "Data da Liberação do Resultado: 05/02/2020 Horário da Liberação do Resultado: 09:49:05"
 [4] "UREIA 27 mg/dL"                                                                        
 [5] "Amostra: Soro"                                                                         
 [6] "Data da Colheita/Recebimento: 05/02/2020 Horário da Colheita/Recebimento: 07:47:03"    
 [7] "CREATININA 0,87 mg/dL"                                                                 
 [8] "Amostra: Soro"                                                                         
 [9] "ÁCIDO ÚRICO 5,2 mg/dL"                                                                 
[10] "Amostra: Soro"                                                                         
[11] "Mulheres: 3,1 a 7,8 mg/dL"

but I'm lost from here. Can anyone shed a light in here?

if you need help, then we should have the `UC_text` same as you have it. So, if you run `dput(UC_text)` on your computer, copy the output and paste it here as your data. check [this](https://stackoverflow.com/questions/49994249/example-of-using-dput) out — Onyambu, Jul 31 '20 at 13:27
It seems your list vector is not complete. You did not copy all the way down since there is a missing clossing paranthesis — Onyambu, Jul 31 '20 at 14:28
Fixed. The closing parenthesis was missing as you've mentioned. — dairelix, Jul 31 '20 at 14:57

Paul van Oppen · Accepted Answer · 2020-07-31T03:45:11.140

There are many ways to approach this and this is intended only to show how a possible solution could look like. I had a quick go at it and worked out something that works under the assumptions:

Groups are separated by the text "Sample: Soro"
The groups have a regular structure (e.g. date of collection is always the 6st line in a group)
Date formats are regular
etc.

If your pdf contains irregularities this task may well become a lot harder. Note also that I am not great at Regex; I am trying to match just the simplest of cases. Others may well come up with better solutions. I converted your data into a tibble with one character column called lines.

Approach: setup a meaningful grouping and slice the relevant lines. Then use regex and class coercion (char to date, char to num).

library(dplyr)
library(tibble)
library(stringr)

data <- data %>%
  mutate(treatments = lead(cumsum(str_detect(lines, pattern = fixed("Sample: Soro", ignore_case = TRUE))), 1)) %>% # grouping variable
  mutate(treatments = ifelse(is.na(treatments), max(treatments, na.rm = TRUE), treatments)) %>% # the `lead` function leaves the last row as `NA`. Fix this.
  group_by(treatments) %>% #group the data
  slice(c(1,6)) %>% #slice the wanted rows by group
  ungroup() %>% # remove grouping
  mutate(lines = str_squish(lines)) %>% # get rid of extra spaces
  mutate(date =  as.Date(str_extract(lines, pattern = "[0-9]{2}\\/[0-9]{2}\\/[0-9]{4}"), "%d/%m/%Y")) %>% # extract dates
  mutate(exam = str_extract(lines, pattern = "^[[:alpha:]]+")) %>% # extract exam
  mutate(qty =  str_extract(lines, pattern = "[0-9]+[,.]?[0-9]*")) %>% # extract quantity
  mutate(unit = "mg/dL") %>% # set unit
  mutate(qty = as.numeric(str_replace(qty, ",", "."))) %>% # replace comma separator by dot
  group_by(treatments) %>% # group
  mutate(date = lead(date,1)) %>% # move the date up one row by group
  slice(1) %>% # slice 1st row by group
  ungroup() %>% # remove group
  select(-treatments, -lines)

With this result:

> data
# A tibble: 4 x 4
  date       exam         qty unit 
  <date>     <chr>      <dbl> <chr>
1 2020-02-05 GLICOSE    77    mg/dL
2 2020-02-05 UREIA      27    mg/dL
3 2020-02-05 CREATININA  0.87 mg/dL
4 2020-02-05 ÁCIDO       5.2  mg/dL

Hope this helps to get you on your way.

Hey Paul! Thank you for taking the time to answer. I did try the method you've mentioned, but as I tried to make it work, an error came up, as ``` Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "character" ``` Should I try another method instead? — dairelix, Jul 31 '20 at 10:31
Please check if you have any issues with parentheses, a missing parenthesise in the dplyr chain can cause such errors. — Paul van Oppen, Aug 03 '20 at 02:06
Low brow error checking: select the first two lines of the `dplyr` chain (without the `%>%` at the end) and run the code. Check for error messages and the returned object. Then run including the next line. Check again etc. This tells you where the error occurs. — Paul van Oppen, Aug 03 '20 at 07:58

Text mining from PDF form to dataframe using R

1 Answers1