4

I am able to extract the text from all pages in pdf successfully. But unable to generate in structured data. Guide me if anyone come across such expertise.

Code:

package pdfboxreadfromfile;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

public class PDFBoxReadFromFile {
  public static void main(String[] args) {
    try {
      File file = new File("C:/ma.pdf");
      PDDocument doc = PDDocument.load(file);
      PDFTextStripper pdfTextStripper = new PDFTextStripper();
      pdfTextStripper.setSortByPosition(true);
      pdfTextStripper.setStartPage(1);
      pdfTextStripper.setEndPage(6);
      String text = pdfTextStripper.getText(doc);
      System.out.println(text);
      doc.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

Output:

enter image description here

PDF looks like this. Page 1: enter image description here

Expected header text is only for reference and need notin print. enter image description here

Tried the following:

Pattern p = Pattern.compile("PO...........*?");
Pattern p1 = Pattern.compile("Vendor...........");
Pattern p2 = Pattern.compile("100.....*?");
Pattern p4 = Pattern.compile("Date...............................................*?");
Pattern p5 = Pattern.compile("62...........3*?");
Pattern p6 = Pattern.compile("62710149950...*?");
Pattern p7 = Pattern.compile("627101499504..*?");

Matcher m = p.matcher(text);
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);
Matcher m4 = p4.matcher(text);
Matcher m5 = p5.matcher(text);
Matcher m6 = p6.matcher(text);
Matcher m7 = p7.matcher(text);
m.find();
m1.find();
m2.find();
m4.find();
m5.find();
m6.find();
m7.find();

System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m5.group(0) + "|");
System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m6.group(0) + "|");
System.out.println(m.group(0) + "|" + m1.group(0) + "|" + m2.group(0) + "|" + m2.group(0) + "|" + "MAC" + "|" + m4.group(0) + "|" + m7.group(0) + "|");

Structured Output. But issue is Quantity against the Barcode alias Product code is not coming. enter image description here

khelwood
  • 55,782
  • 14
  • 81
  • 108
Leace
  • 262
  • 1
  • 7
  • 24
  • 1
    Can you please show us the PDF file – Karam Mohamed May 29 '20 at 17:44
  • @KaramMohamed attached pdf page 1 contents and view – Leace May 29 '20 at 19:08
  • Itext and pdfbox are general purpose pdf libraries and not specialized table data extractors. Other products use these libraries as base for a specialized table data extraction feature. You may want to try such products. E.g. [pdf2Data](https://itextpdf.com/en/products/itext-7/pdf2data) and [tabula](https://tabula.technology/). – mkl May 29 '20 at 20:57
  • @Leace i think you have to keep parsing your PDF and ignoring the lines until you arrive to what you want , that's only solution here because you're extracting a text , it doesnt differentiate informations – Karam Mohamed May 30 '20 at 00:56
  • Using java array list is it possible to achieve this ? – Leace May 30 '20 at 06:48
  • Based on the review i can see Textricator is a tool to extract text from documents and generate structured data. Do anyone touched on this tool ? . Please do the knowledge sharing – Leace May 30 '20 at 09:50
  • i can suggest you to remove all next line characters and make a single paragraph and split the contents. then store it on list of object and pass the values while you design your custom PDF. – Natsu May 31 '20 at 19:59
  • @Natsu do you have any link which gives any sample about your suggestion ? – Leace May 31 '20 at 20:10
  • @Leace i worked with bulk text data extraction and print those into formatted PDF. so i personally follow that approach. you can easily do that. i don't have any links to help you. sorry. – Natsu May 31 '20 at 20:19
  • @Leace as i suggest earlier, try to remove all line breaks and make everything in a single paragraph. first identify all the fixed values like address, telephone and column names. then make those as a boundary to split the values in between two fixed values by trimming and store those inside your custom define object. then you can easily pass the values wherever inside PDF. – Natsu May 31 '20 at 20:26
  • @Natsu trimming between two fixed value also issue here due to the description with dual langauge for each product. Some product code qty is coming exacty by trimming between two fixed vlaue where as rpdocut code like "6271014995024" qty is not coming properly since the value is one step down. How to handle such a situation – Leace May 31 '20 at 21:22
  • @Leace i can't provide the the exact solution without seeing the data, but as i understood, if you mean one step down as line break, you can remove line break before start the splitting process. there is a way to remove line breaks in the text before processing. and also don't think about single file process. you have the flexibility of creating separate files for your convenient in order to process the document such as split ting header part separately, and column data separately. then split again using relevant field. – Natsu May 31 '20 at 21:58
  • @Leace also if you are dealing with different language data, then try to find any field which can uniquely able to identify the language. if you find anything like that you can use that data to use a conditional separation method to handle for multiple language situations. something like `if this language identified then use this method to split the data, or else this method` – Natsu May 31 '20 at 22:01
  • Never tried the tabula java, but there is a good version of python equivalent [camelot-py](https://github.com/atlanhq/camelot) with more lineance and control in parsing the data. – ExtractTable.com Jun 03 '20 at 12:55

1 Answers1

0

You should search the text for the header line (Barcode, Item number, ...) and then parse each following line by splitting it to columns. The columns are separated by spaces, so you can use the String.split() function.

majster
  • 104
  • 4