Loop through Word/PDF documents and extract specific text to table R

Question

I have a folder with around 150 Word and PDF (same text) documents. Data is here: http://www.sicgen.pt/antigen_folder/data_sheet/AB0003_ERP57_AB_data_sheet2003.pdf

Text is always like (after loading with pdftools):

library(pdftools)
u <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")

[1] "                                                                     Product Data Sheet\r\n                                                                                      001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n                                  ERp57 Polyclonal Antibody\r\nSource: Goat                                               phospholipase C alpha, PI PLC, protein disulfide\r\n                                                           isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This                   Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin        (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate           sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has         purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by        Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their            within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates.                                   ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated                Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide              blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident            mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample                Western blot      Immuno-        Histochemistry (paraffin)     Histochemistry (frozen)\r\n                                        fluorescence\r\nhuman                 +++               +++            +++                           +++\r\nrat                   +++               +++            +++                           +++\r\nmouse                 +++               +++            +++                           +++\r\ncanine                +++               +++            +++                           +++\r\nmonkey                +++               +++            +++                           +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot                    1:500-1:2,000       Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence                        1:50-1:500       at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin)        1:200-1:1,000\r\nImmunohistochemistry (frozen)          1:200-1:1,000       Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                           information@sicgen.pt\r\n"
[2] "                                                                          Product Data Sheet\r\n                                                                                             001 Rev1 Jan 2012 by JR\r\nReferences:\r\n                                    For research use only, not for diagnostic use\r\nSICGEN's Proprietary Immunogen Policy\r\nIn order to produce high specific antibodies SICGEN has invested a lot of time and effort into selecting immunogen\r\nsequences. SICGEN has decided to protect this information by not publishing it on the website. However, these sequences\r\nare available on request.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                                  information@sicgen.pt\r\n"

I wish to transform into a dataframe or table in either R or excell.

 Catalogue.No.  Name Source.
1    AB0003-200 ERp57    Goat
2    AB0004-500 (...)   (...)
                                                                                                  General.Description
1 Goat polyclonal to ERp57 -  endoplasmic reticulum lumen marker.  This endoplasmic reticulum protein interacts (...)
2                                                                                                               (...)
                        Alternative.names.
1 58 kDa glucose  regulated protein, (...)
2                                    (...)
                                                               Form.
1 Polyclonal antibody supplied as a  200 µl (2 mg/ml) aliquot in PBS
2                                                              (...)
                                                       Immunogen
1 Recombinant peptide derived  from within residues 300 aa (...)
2                                                          (...)
                       Specificity.                     Reactivity.
1 Detects a band of  60 kDa by(...) Reacts against  human, rat, ...
2                             (...)                           (...)
                                         Usage.
1 Western blot 1:500-1:2,000 Immunofluorescence
2                                         (...)

I want to format it into table format. Here is the import from a PDF file.

textImport <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] "                                                                     Product Data Sheet\r\n                                                                                      001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n                                  ERp57 Polyclonal Antibody\r\nSource: Goat                                               phospholipase C alpha, PI PLC, protein disulfide\r\n                                                           isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This                   Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin        (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate           sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has         purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by        Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their            within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates.                                   ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated                Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide              blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident            mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample                Western blot      Immuno-        Histochemistry (paraffin)     Histochemistry (frozen)\r\n                                        fluorescence\r\nhuman                 +++               +++            +++                           +++\r\nrat                   +++               +++            +++                           +++\r\nmouse                 +++               +++            +++                           +++\r\ncanine                +++               +++            +++                           +++\r\nmonkey                +++               +++            +++                           +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot                    1:500-1:2,000       Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence                        1:50-1:500       at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin)        1:200-1:1,000\r\nImmunohistochemistry (frozen)          1:200-1:1,000       Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                           information@sicgen.pt\r\n"
[2] "                                                                          Product Data Sheet\r\n                                                                                             001 Rev1 Jan 2012 by JR\r\nReferences:\r\n                                    For research use only, not for diagnostic use\r\n

If you have any suggestion please let me know.

What have you tried so far? Can you provide an example of the data as it looks when it's read into R? Does it look like "Catalogue No. AB0003-200 Qty: 400 µg (2 mg/ml)\n\nERp57 Polyclonal Antibody\n\n"? — AodhanOL, Jan 05 '18 at 11:44
I have tried pdftools and qdapRegex in R but does not work as I expect I get NA's. This example I gave was just a "handmade" dataframe. The data is a simple text document with images and flowing text as above. — Filipe Rigueiro, Jan 05 '18 at 12:51
could you include a sample of how the text actually looks when it's imported? If it's failing at import, that needs to be fixed first. If it's importing but in an unexpected format that's a different issue which can be addressed by seeing what it looks like. — AodhanOL, Jan 05 '18 at 13:03
Thanks for adding the example, I'll have a look at this later on today if no-one else gets back in the meantime. — AodhanOL, Jan 05 '18 at 13:51
Brilliant. Im trying as well some other things but no luck for now. — Filipe Rigueiro, Jan 05 '18 at 14:15
How about using `pdftext` and then `regexpr`, `regexec` and `regmatches` (to isolate, get the position and extract) specific patterns? For example, if you want to isolate 'goat' you could look for the text found between the strings 'Source' and 'General Instructions'. — Gautam, Jan 05 '18 at 14:41
Have to agree with @Gautam, I think that's the best option. It's a bit manual, but it will have to feature. — AodhanOL, Jan 05 '18 at 14:57
My answer below should still work - you can splice pdfs that are two columns. See the answer posted here: https://stackoverflow.com/questions/42541849/extract-text-from-two-column-pdf-with-r — Gautam, Jan 05 '18 at 18:34

Gautam · Answer 1 · 2018-01-05T15:24:29.170

Couldn't post code in comments so here's a possible approach using pdftools and regular expressions.

DATA

I used the same data that you provided and saved it to a pdf called "pdf_catalogue.pdf".

CODE

library(pdftools)
u <- pdf_text("pdf_catalogue.pdf")

get_string <- function(pattern, string){
  inter_list <- regmatches(string, regexec(pattern, string))
  if(length(inter_list) > 0){

    replace_patterns_list <- list("\r", "\n") #add others as required
    replace_patterns <- paste(unlist(replace_patterns_list), collapse = "|")

    inter_string <- gsub(replace_patterns, "", inter_list[[1]][2])
    return(inter_string)
  }

}

pat_source <- "Source: (.*)General description"
pat_description <- "General description: (.*)Alternative"
pat_form <- "Form: (.*)Immunogen"
pat_names <- "Alternative names: (.*)Form"

dat <- list(Source = get_string(pat_source, u),
        General_description = get_string(pat_description, u), 
        Form = get_string(pat_source, u), 
        Alternative_names = get_string(pat_names, u))

The get_string function returns anything contained between the strings before and after the (.*). This is based on the assumption that the file structure is consistent as your question implies. You may need to do a "lazy search" using (.*?) if needed. There's an excelled video by Roger Peng explaining regular expressions if you're unfamiliar with them.

OUTPUT

> dat
$Source
[1] "Goat"

$General_description
[1] "Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker.This endoplasmic reticulum protein interacts with lectin chaperones calreticulin andcalnexin to modulate folding of newly synthesized glycoproteins. It has disulfideisomerase activity and complexes of lectins and this protein mediate protein folding bypromoting formation of disulfide bonds in their glycoprotein substrates."

$Form
[1] "Goat"

$Alternative_names
[1] "58 kDa glucose regulated protein, 58 kDa microsomal protein,disulfide isomerase ER 60, endoplasmic reticulum resident protein 57, endoplasmicreticulum resident protein 60, ER protein 57, ER protein 60, ER protein 61, ERP57,ERp60, ERp61, glucose regulated protein 58 Kd, GRP57, GRP58, HsT17083, P58,PDIA3, phospholipase C alpha, PI PLC, protein disulfide isomerase A3 antibody."

You may want to further split the output based on the structure. For example in Alternative names the names seeem to be all separated by commas. You could try

> strsplit(dat$Alternative_names, ", ")

which gives

[[1]]
 [1] "58 kDa glucose regulated protein"                   
 [2] "58 kDa microsomal protein,disulfide isomerase ER 60"
 [3] "endoplasmic reticulum resident protein 57"          
 [4] "endoplasmicreticulum resident protein 60"           
 [5] "ER protein 57"                                      
 [6] "ER protein 60"                                      
 [7] "ER protein 61"                                      
 [8] "ERP57,ERp60"                                        
 [9] "ERp61"                                              
[10] "glucose regulated protein 58 Kd"                    
[11] "GRP57"                                              
[12] "GRP58"                                              
[13] "HsT17083"                                           
[14] "P58,PDIA3"                                          
[15] "phospholipase C alpha"                              
[16] "PI PLC"                                             
[17] "protein disulfide isomerase A3 antibody."

Notice that using a space after comma (,) results in the second element having two names. You'd need to use , to avoid such errors. This is especially important for .pdf files. You can also easily break multiple lines into separate fields by defining breaks appropriately (period followed by an upper case alphabet). Regular expressions should let you address all such use cases.

This is a rather minimal example but you can easily build on it to cover other fields/combinations you may want from the file.

For multiple files, I'd recommend enclosing all of this in a function (once you've finalized your code) and looping through the directory using lapply. I use something similar to go over .txt and .csv files.

Hope this is helpful. Cheers!

Loop through Word/PDF documents and extract specific text to table R

1 Answers1