2

Problem Statement: I am unable to read data from a PDF file using SAS.

What worked well: I am able to download the PDF from the website and save it.

Not working (Need Help): I am not able to read the data from a PDF file using SAS. The source content structure is expected to remain the same always. Expected Output is attached as a jpg image.

It would be a great learning and help if someone knows and help me how to tackle this scenario by using SAS program. Below image is the source in PDF format and Same is Expected Result in SAS dataset format:

I tried something like this:

/*Proxy address*/
%let proxy_host=xxx.com;
%let port=123;

/*Output location*/
filename output "/desktop/Response.pdf";

/*Download the source file and save it in the desired location*/
proc http           
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"       
method="get"        
proxyhost="&proxy_host."        
proxyport=&port         
out=output;     
run;

%let lineSize = 2000;

data base;
   format text_line $&lineSize..;
   infile output lrecl=&lineSize;
   input text_line $;
run;

DATA _NULL_ ;
X "PS2ASCII /desktop/Response.pdf
/desktop/flatfile.txt";
RUN;
mrivanlima
  • 561
  • 4
  • 10
anil kumar
  • 41
  • 5
  • 1
    Do you have SAS text analytics? I think that's the only tool that has the built in functionality to extract this information. However, a really good alternative that's simple is to use Adobe Save to Excel (or text) and then extract the information. If your table is well structured as shown and always have the same format that would probably work consistently. If you have Adobe Pro you can save to Excel, if just regular version you'll have to use the text approach. – Reeza Dec 10 '20 at 20:10
  • It looks like you've tried piping it to the text file, what happened there? Is the text file not readable now? – Reeza Dec 10 '20 at 20:10
  • 1
    There is no need to place an `X` command inside of a DATA step. But you might want to use the PIPE filename engine so your data step could read any error messages that PS2ASCII might emit. Does PS2ASCII actually work on a PDF file? – Tom Dec 10 '20 at 22:51
  • There are several open source tools out there that can read and interpret PDFs depending on the internal structure, including OCR (which sometimes is all you can do). Your first step should be to get it into an xlsx, csv, or txt format. It looks like the structure will make it a viable option with a variety of tools. Once you do that, you can read it in SAS like any other raw file. I have also seen some interesting white papers that talk about reading uncompressed PDFs directly within SAS, such as this pdf2sas macro: https://support.sas.com/resources/papers/proceedings16/9320-2016.pdf – Stu Sztukowski Dec 11 '20 at 01:45

1 Answers1

2

You can use Apache PDFBox® library which is an open source Java tool for working with PDF documents. The library can be utilized from within SAS Proc GROOVY with Java code that strips text and it's position on page from a PDF document.

Example:

You will have to write more code to make a data set from the stripped text.

filename overview "overview.pdf";
filename ov_text  "overview.txt";

* download a pdf document;

proc http           
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"       
method="get"        
/*proxyhost="&proxy_host."        */
/*proxyport=&port         */
out=overview;     
run;

* download the Apache PDFBox library (a .jar file); 

filename jar 'pdfbox.jar';

%if %sysfunc(FEXIST(jar)) ne 1 %then %do;
  proc http
    url='https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.21/pdfbox-app-2.0.21.jar&action=download'
    out=jar;
  run;
%end;

* Use GROOVY to read the PDF, strip out the text and position, and write that
* parse to a text file which SAS can read;

proc groovy classpath="pdfbox.jar"; 
  submit 
    "%sysfunc(pathname(overview))"  /* the input, a pdf file */
    "%sysfunc(pathname(ov_text))"   /* the output, a text file */
  ;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import java.io.FileWriter;
import java.io.PrintWriter;

public class GetLinesFromPDF extends PDFTextStripper {
    
    static List<String> lines = new ArrayList<String>();
    public GetLinesFromPDF() throws IOException {
    }
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        PrintWriter out = null;
        String inPdf = args[0];
        String outTxt = args[1];

        try {
            document = PDDocument.load( new File(inPdf) );

            PDFTextStripper stripper = new GetLinesFromPDF();

            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );

            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            out = new PrintWriter(new FileWriter(outTxt));

            // print lines to text file
            for(String line:lines){
              out.println(line); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
            if( out != null ) {
                out.close();
            }
        }
    }
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String places = "";

        for(TextPosition tp:textPositions){
          places += "(" + tp.getX() + "," + tp.getY() + ") ";
        }

        lines.add(str + " found @ " + places);
    }
}

  endsubmit;
quit;

* preview the stripped text that was saved;

data _null_;
  infile ov_text;
  input;
  putlog _infile_;
run;

/*
 * additional SAS code will be needed to input the text as data 
 * and construct a data set that matches the original tabular content layout
 */
Richard
  • 25,390
  • 3
  • 25
  • 38
  • The comment got clipped. What was the error ? What version of SAS are you using ? (Help/About... Software Information) Code shown was tested in "SAS 9.4 TS Level 1M6" – Richard Dec 13 '20 at 19:44
  • Thank you for the quick turnaround ! I tried your suggestion, But I was getting error while downloading the library: ERROR: Insufficient authorization to access /opt/sas/94/comp/config/Lev1/SASApp/pdfbox.jar. ERROR: Generic HTTP Client Error Since my org is little conservative, Hence it blocked some websites, so am not able to download the library. Should I contact my system admin for this ? Please suggest ! Thank you. – anil kumar Dec 13 '20 at 19:45
  • The path in the message shows you are running SAS code on Unix SAS Server host through EG workstation session, stored process or Studio login. Looks like your were able to download the .jar but it is in an unreadable place. Try again but specify explicit path in `filename jar`. And yes, it's always best practice to inform sys admin when you need to use capabilities that might brush up against their policies. Also, you might find Proc GROOVY locked down, in which case you need to tell them you need the Proc for work. – Richard Dec 14 '20 at 00:59
  • Now am able to download the jar file. But Proc Groovy is giving error, Example : org.codehaus.groovy.runtime.InvokerInvocationException: java.io.FileNotFoundException: (No such file or directory) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:105) ERROR: The SUBMIT command failed. – anil kumar Dec 14 '20 at 10:23
  • You have to make sure the `filename ov_text` is pointing to a directory and file you can write to. `filename ov_text "/tmp/overview.txt"` might work. Or even `filename ov_text temp`. The temp file would be removed after the session ends, or become session orphaned or deleted if the same filename statement was reissued. – Richard Dec 14 '20 at 15:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225951/discussion-between-anil-hagargi-and-richard). – anil kumar Dec 14 '20 at 19:41
  • Your code is working superb, I could download the pdf content in the text format. But can you please help me in parsing the data ? – anil kumar Dec 15 '20 at 08:22
  • Proc Groovy stopped working. I am assuming it is because of Apache PDFBox new version 2.0.22 ? Do we have to change the code according to new version ? When I run the code, the error is - ERROR: The SUBMIT command failed. org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed: Script10.groovy: 2: unable to resolve class org.apache.pdfbox.pdmodel.PDDocument – anil kumar Jan 15 '21 at 07:28