Extract text from PDF documents and generate structured data

Question

I am able to extract the text from all pages in pdf successfully. But unable to generate in structured data. Guide me if anyone come across such expertise.

Code:

package pdfboxreadfromfile;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

public class PDFBoxReadFromFile {
  public static void main(String[] args) {
    try {
      File file = new File("C:/ma.pdf");
      PDDocument doc = PDDocument.load(file);
      PDFTextStripper pdfTextStripper = new PDFTextStripper();
      pdfTextStripper.setSortByPosition(true);
      pdfTextStripper.setStartPage(1);
      pdfTextStripper.setEndPage(6);
      String text = pdfTextStripper.getText(doc);
      System.out.println(text);
      doc.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

Output:

PDF looks like this. Page 1:

Expected header text is only for reference and need notin print.

Tried the following:

Pattern p = Pattern.compile("PO...........*?");
Pattern p1 = Pattern.compile("Vendor...........");
Pattern p2 = Pattern.compile("100.....*?");
Pattern p4 = Pattern.compile("Date...............................................*?");
Pattern p5 = Pattern.compile("62...........3*?");
Pattern p6 = Pattern.compile("62710149950...*?");
Pattern p7 = Pattern.compile("627101499504..*?");

Matcher m = p.matcher(text);
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);
Matcher m4 = p4.matcher(text);
Matcher m5 = p5.matcher(text);
Matcher m6 = p6.matcher(text);
Matcher m7 = p7.matcher(text);
m.find();
m1.find();
m2.find();
m4.find();
m5.find();
m6.find();
m7.find();

System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m5.group(0) + "|");
System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m6.group(0) + "|");
System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m7.group(0) + "|");

Structured Output. But issue is Quantity against the Barcode alias Product code is not coming.

Itext and pdfbox are general purpose pdf libraries and not specialized table data extractors. Other products use these libraries as base for a specialized table data extraction feature. You may want to try such products. E.g. [pdf2Data](https://itextpdf.com/en/products/itext-7/pdf2data) and [tabula](https://tabula.technology/). — mkl, May 29 '20 at 20:57
@Leace i think you have to keep parsing your PDF and ignoring the lines until you arrive to what you want , that's only solution here because you're extracting a text , it doesnt differentiate informations — Karam Mohamed, May 30 '20 at 00:56
Based on the review i can see Textricator is a tool to extract text from documents and generate structured data. Do anyone touched on this tool ? . Please do the knowledge sharing — Leace, May 30 '20 at 09:50
i can suggest you to remove all next line characters and make a single paragraph and split the contents. then store it on list of object and pass the values while you design your custom PDF. — Natsu, May 31 '20 at 19:59
@Natsu do you have any link which gives any sample about your suggestion ? — Leace, May 31 '20 at 20:10
@Leace i worked with bulk text data extraction and print those into formatted PDF. so i personally follow that approach. you can easily do that. i don't have any links to help you. sorry. — Natsu, May 31 '20 at 20:19
@Leace as i suggest earlier, try to remove all line breaks and make everything in a single paragraph. first identify all the fixed values like address, telephone and column names. then make those as a boundary to split the values in between two fixed values by trimming and store those inside your custom define object. then you can easily pass the values wherever inside PDF. — Natsu, May 31 '20 at 20:26
@Natsu trimming between two fixed value also issue here due to the description with dual langauge for each product. Some product code qty is coming exacty by trimming between two fixed vlaue where as rpdocut code like "6271014995024" qty is not coming properly since the value is one step down. How to handle such a situation — Leace, May 31 '20 at 21:22
@Leace i can't provide the the exact solution without seeing the data, but as i understood, if you mean one step down as line break, you can remove line break before start the splitting process. there is a way to remove line breaks in the text before processing. and also don't think about single file process. you have the flexibility of creating separate files for your convenient in order to process the document such as split ting header part separately, and column data separately. then split again using relevant field. — Natsu, May 31 '20 at 21:58
@Leace also if you are dealing with different language data, then try to find any field which can uniquely able to identify the language. if you find anything like that you can use that data to use a conditional separation method to handle for multiple language situations. something like `if this language identified then use this method to split the data, or else this method` — Natsu, May 31 '20 at 22:01
Never tried the tabula java, but there is a good version of python equivalent [camelot-py](https://github.com/atlanhq/camelot) with more lineance and control in parsing the data. — ExtractTable.com, Jun 03 '20 at 12:55

score 0 · Answer 1 · answered Jun 01 '20 at 13:58

0

You should search the text for the header line (Barcode, Item number, ...) and then parse each following line by splitting it to columns. The columns are separated by spaces, so you can use the String.split() function.

answered Jun 01 '20 at 13:58

majster

104
4

Extract text from PDF documents and generate structured data

1 Answers1