How to read the table inside PDF and print the table in HTML format or In Console?

Question

I need to read the table inside a pdf file and print the table in HTML format or in Console exactly in PDF. I have a sample code which reading the text inside the table, But I need to read the table column wise and row wise and print as we seen in the Image.I use PDFBox as Jar. Refer this sample image

import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
import org.pdfbox.util.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.PrintWriter;
public class PDFreader {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor 
public PDFTextParser() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " +  fileName +"....");     
    File f = new File(fileName);
    if (!f.isFile()) {
        System.out.println("File " + fileName + " does not exist.");
        return null;
    }
    try {
        parser = new PDFParser(new FileInputStream(f));
    } catch (Exception e) {
        System.out.println("Unable to open PDF Parser.");
        return null;
    }
    try {
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc); 
    } catch (Exception e) {
        System.out.println("An exception occured in parsing the PDF Document.");
        e.printStackTrace();
        try {
               if (cosDoc != null) cosDoc.close();
               if (pdDoc != null) pdDoc.close();
           } catch (Exception e1) {
           e.printStackTrace();
        }
        return null;
    }      
    System.out.println("Done.");
    return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
    System.out.println("\nWriting PDF text to output text file " + fileName + "....");
    try {
        PrintWriter pw = new PrintWriter(fileName);
        pw.print(pdfText);
        pw.close();     
    } catch (Exception e) {
        System.out.println("An exception occured in writing the pdf text to file.");
        e.printStackTrace();
    }
    System.out.println("Done.");
}
public static void main(String args[]) {
    String fileList[] = {"SO115638.pdf","New_Document.txt"};
    if (fileList.length != 2) {
        System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
        System.exit(1);
    }
    PDFTextParser pdfTextParserObj = new PDFTextParser();
    String pdfToText = pdfTextParserObj.pdftoText(fileList[0]);
    if (pdfToText == null) {
        System.out.println("PDF to Text Conversion failed.");
    }
    else {
        System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
        pdfTextParserObj.writeTexttoFile(pdfToText, fileList[1]);
    }
}  }

I don't think that is really possible. There isn't anything analogous to a table in pdf. It is just text with instructions on where to place that text, so it is likely going to be very difficult to figure out what is a table. The pdf doesn't contain anything saying "this is a row" and "this is a column". — Matthew, Feb 17 '16 at 06:16
The OP of [the question "How to find table border lines in pdf using PDFBox?"](http://stackoverflow.com/q/35409283/1729265) has a similar problem. — mkl, Feb 17 '16 at 10:34

Ataur Rahman Munna · Answer 1 · 2016-02-17T06:38:11.043

Just a simple edit in your code. Instead of file name you should give the absolute path of that file name. For example:

String fileList[] = {"E:\\JavaApplication14\\src\\javaapplication14\\p10071.pdf", "E:\\JavaApplication14\\src\\javaapplication14\\newTextDocument.txt"};

Include the jar's to your project class path.

commons-logging-1.1.1.jar
fontbox-1.4.0.jar
pdfbox-1.2.0.jar

Complete code be like (before run change your class name as well as constructor ):

public class PDFTextParser {

    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    COSDocument cosDoc;
//    PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor 
    public PDFTextParser() {
    }
// Extract text from PDF Document
    String pdftoText(String fileName) {
        System.out.println("Parsing text from PDF file " + fileName + "....");
        File f = new File(fileName);
        if (!f.isFile()) {
            System.out.println("File " + fileName + " does not exist.");
            return null;
        }
        try {
            parser = new PDFParser(new FileInputStream(f));
        } catch (Exception e) {
            System.out.println("Unable to open PDF Parser.");
            return null;
        }
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc);
        } catch (Exception e) {
            System.out.println("An exception occured in parsing the PDF Document.");
            e.printStackTrace();
            try {
                if (cosDoc != null) {
                    cosDoc.close();
                }
                if (pdDoc != null) {
                    pdDoc.close();
                }
            } catch (Exception e1) {
                e.printStackTrace();
            }
            return null;
        }
        System.out.println("Done.");
        return parsedText;
    }
// Write the parsed text from PDF to a file
    void writeTexttoFile(String pdfText, String fileName) {
        System.out.println("\nWriting PDF text to output text file " + fileName + "....");
        try {
            PrintWriter pw = new PrintWriter(fileName);
            pw.print(pdfText);
            pw.close();
        } catch (Exception e) {
            System.out.println("An exception occured in writing the pdf text to file.");
            e.printStackTrace();
        }
        System.out.println("Done.");
    }

    public static void main(String args[]) {
        String fileList[] = {"E:\\JavaApplication14\\src\\javaapplication14\\p10071.pdf", "E:\\JavaApplication14\\src\\javaapplication14\\newTextDocument.txt"};
        if (fileList.length != 2) {
            System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
            System.exit(1);
        }
        PDFTextParser pdfTextParserObj = new PDFTextParser();
        String pdfToText = pdfTextParserObj.pdftoText(fileList[0]);
        if (pdfToText == null) {
            System.out.println("PDF to Text Conversion failed.");
        } else {
            System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
            pdfTextParserObj.writeTexttoFile(pdfToText, fileList[1]);
        }
    }
}

How to read the table inside PDF and print the table in HTML format or In Console?

1 Answers1