0

I'm facing the following use case :

I receive one pdf that contains many documents. Each document has a different number of page. They are separated by barcode page.

Is it possible to split a multipage PDF containing several documents that are separated by a page With a barcode, and create New pdf's, one for each document?

I read that we can split a pdf with Itext : https://developers.itextpdf.com/examples/stamping-content-existing-pdfs/clone-splitting-pdf-file

But I don't find on the web the way to split it when i detect barcode page.

UPDATE : @mkl I have found how to read text from QR Code with zxing: It works with simple png file

File QRfile = new File("test.png");

BufferedImage bufferedImg = ImageIO.read(QRfile);
LuminanceSource source = new BufferedImageLuminanceSource(bufferedImg);
BinaryBitmap bitmap = new BinaryBitmap(new HybridBinarizer(source));

Result result = new MultiFormatReader().decode(bitmap);

System.out.println("Barcode Format: " + result.getBarcodeFormat());
                        System.out.println("Content: " + result.getText());

But it doesn't work in loop. I test with pdf document (7 pages)

Here JAVA Code :

PdfDocument pdfDoc;
pdfDoc = new PdfDocument(new PdfReader(pathName));
logger.debug("pdfDoc OK"); 
PdfDocumentContentParser contentParser = new PdfDocumentContentParser(pdfDoc);
for (int page = 1; page <= pdfDoc.getNumberOfPages(); page++)
{
    logger.debug("page: " + page); 
    contentParser.processContent(page, new IEventListener()
    {
        @Override
        public Set<EventType> getSupportedEvents()
        {
            logger.debug("inside getSupportedEvents"); 
            return Collections.singleton(RENDER_IMAGE);
        }

        @Override
        public void eventOccurred(IEventData data, EventType type)
        {
            index = index + 1;
            logger.debug("inside eventOccurred - data: " + data);
            logger.debug("inside eventOccurred - type: " + type);
            logger.debug("inside eventOccurred - index: " + index);
            if (data instanceof ImageRenderInfo)
            {
                logger.debug("data instanceof ImageRenderInfo"); 
                ImageRenderInfo imageRenderInfo = (ImageRenderInfo) data;
                byte[] bytes = imageRenderInfo.getImage().getImageBytes();
                try
                {
                    logger.debug("avant Files writer");
                    String pngName = "C:/alfresco/klinck/splitImage-" + index + ".png";
                    logger.debug("pngName: " + pngName);
                    Files.write(new File(pngName).toPath(), bytes);
                    logger.debug("Files written");
                    File QRfile = new File(pngName);
                    logger.debug("QR File trouvé ! ");
                    BufferedImage bufferedImg = ImageIO.read(QRfile);
                    logger.debug("bufferedImg OK ");
                    LuminanceSource source = new BufferedImageLuminanceSource(bufferedImg);
                    logger.debug("source OK ");
                    BinaryBitmap bitmap = new BinaryBitmap(new HybridBinarizer(source));
                    logger.debug("bitmap OK");
                    Result result = new MultiFormatReader().decode(bitmap);
                    logger.debug("SplitFluxJobExcecuter - resultBarcodeFormat: " + result.getBarcodeFormat());
                    logger.debug("SplitFluxJobExcecuter - result.getText(): " + result.getText());
                }catch (Exception e)
                {
                   logger.error("SplitJobExecuter Exception : " + ExceptionUtils.getStackTrace(e));
                }
            }
        }
        int index = 0;

        });
    }

First page contains 3 images (1 QR Code) . I get "com.google.zxing.NotFoundException" during last Step.

This is Log:

2018-07-25 16:27:00,227 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] pdfDoc OK
2018-07-25 16:27:00,227 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] page: 1
2018-07-25 16:27:00,237 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside getSupportedEvents

2018-07-25 16:27:00,265 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - data: com.itextpdf.kernel.pdf.canvas.parser.data.ImageRenderInfo@2472ac79
2018-07-25 16:27:00,266 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - type: RENDER_IMAGE
2018-07-25 16:27:00,266 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - index: 1
2018-07-25 16:27:00,266 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] data instanceof ImageRenderInfo
2018-07-25 16:27:00,266 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] avant Files writer
2018-07-25 16:27:00,266 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] pngName: C:/alfresco/klinck/splitImage-1.png
2018-07-25 16:27:00,270 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] Files written
2018-07-25 16:27:00,270 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] QR File trouvé ! 
2018-07-25 16:27:00,304 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bufferedImg OK 
2018-07-25 16:27:00,305 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] source OK 
2018-07-25 16:27:00,306 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bitmap OK
2018-07-25 16:27:00,407 ERROR [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] SplitJobExecuter Exception : com.google.zxing.NotFoundException

2018-07-25 16:27:00,407 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - data: com.itextpdf.kernel.pdf.canvas.parser.data.ImageRenderInfo@6e036aea
2018-07-25 16:27:00,407 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - type: RENDER_IMAGE
2018-07-25 16:27:00,407 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - index: 2
2018-07-25 16:27:00,407 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] data instanceof ImageRenderInfo
2018-07-25 16:27:00,408 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] avant Files writer
2018-07-25 16:27:00,408 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] pngName: C:/alfresco/klinck/splitImage-2.png
2018-07-25 16:27:00,411 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] Files written
2018-07-25 16:27:00,411 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] QR File trouvé ! 
2018-07-25 16:27:00,415 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bufferedImg OK 
2018-07-25 16:27:00,415 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] source OK 
2018-07-25 16:27:00,415 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bitmap OK
2018-07-25 16:27:00,473 ERROR [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] SplitJobExecuter Exception : com.google.zxing.NotFoundException

2018-07-25 16:27:00,474 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - data: com.itextpdf.kernel.pdf.canvas.parser.data.ImageRenderInfo@4c205db7
2018-07-25 16:27:00,474 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - type: RENDER_IMAGE
2018-07-25 16:27:00,474 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - index: 3
2018-07-25 16:27:00,474 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] data instanceof ImageRenderInfo
2018-07-25 16:27:00,474 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] avant Files writer
2018-07-25 16:27:00,474 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] pngName: C:/alfresco/klinck/splitImage-3.png
2018-07-25 16:27:00,478 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] Files written
2018-07-25 16:27:00,478 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] QR File trouvé ! 
2018-07-25 16:27:00,479 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bufferedImg OK 
2018-07-25 16:27:00,479 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] source OK 
2018-07-25 16:27:00,479 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bitmap OK
2018-07-25 16:27:00,484 ERROR [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] SplitJobExecuter Exception : com.google.zxing.NotFoundException

From page 2 to page 7, the error message is different :

2018-07-25 16:27:00,487 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] page: 2
2018-07-25 16:27:00,488 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside getSupportedEvents
2018-07-25 16:27:00,488 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - data: com.itextpdf.kernel.pdf.canvas.parser.data.ImageRenderInfo@6d41ffa2
2018-07-25 16:27:00,488 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - type: RENDER_IMAGE
2018-07-25 16:27:00,488 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] inside eventOccurred - index: 1
2018-07-25 16:27:00,489 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] data instanceof ImageRenderInfo
2018-07-25 16:27:00,489 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] avant Files writer
2018-07-25 16:27:00,489 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] pngName: C:/alfresco/klinck/splitImage-1.png
2018-07-25 16:27:00,492 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] Files written
2018-07-25 16:27:00,493 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] QR File trouvé ! 
2018-07-25 16:27:00,493 DEBUG [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] bufferedImg OK 
2018-07-25 16:27:00,493 ERROR [com.klinck.mc.jobs.SplitFluxJobExecuter] [schedulerSplit_Worker-1] SplitJobExecuter Exception : java.lang.NullPointerException
    at com.google.zxing.client.j2se.BufferedImageLuminanceSource.<init>(BufferedImageLuminanceSource.java:42)
    at com.klinck.mc.jobs.SplitFluxJobExecuter$1.eventOccurred(SplitFluxJobExecuter.java:150)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.eventOccurred(PdfCanvasProcessor.java:534)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.displayImage(PdfCanvasProcessor.java:573)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.access$5800(PdfCanvasProcessor.java:108)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor$ImageXObjectDoHandler.handleXObject(PdfCanvasProcessor.java:1420)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.displayXObject(PdfCanvasProcessor.java:566)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.access$5600(PdfCanvasProcessor.java:108)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor$DoOperator.invoke(PdfCanvasProcessor.java:1285)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.invokeOperator(PdfCanvasProcessor.java:452)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processContent(PdfCanvasProcessor.java:281)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processPageContent(PdfCanvasProcessor.java:302)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfDocumentContentParser.processContent(PdfDocumentContentParser.java:77)
    at com.itextpdf.kernel.pdf.canvas.parser.PdfDocumentContentParser.processContent(PdfDocumentContentParser.java:90)
    at com.klinck.mc.jobs.SplitFluxJobExecuter.execute(SplitFluxJobExecuter.java:118)
    at com.klinck.mc.jobs.SplitFluxJob$1.doWork(SplitFluxJob.java:27)
    at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:555)
    at com.klinck.mc.jobs.SplitFluxJob.executeJob(SplitFluxJob.java:24)
    at org.alfresco.schedule.ScheduledJobLockExecuter.execute(ScheduledJobLockExecuter.java:94)
    at org.alfresco.schedule.AbstractScheduledLockedJob.executeInternal(AbstractScheduledLockedJob.java:72)
    at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:114)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:563)

UPDATE 2

I think the error message "com.google.zxing.NotFoundException" appears because images don't contain text message or are too large : com.google.zxing.NotFoundException exception comes when core java program executed?

anakin59490
  • 630
  • 1
  • 11
  • 28
  • 1
    How can your bar codes be recognized? – mkl Jul 23 '18 at 16:02
  • update : we will use QR code instead of bar codes – anakin59490 Jul 24 '18 at 08:13
  • The obvious way would be to search for the QR codes in the PDF (if they are embedded as bitmaps, simply extract all images from the PDF and scan them using e.g. zxing). Then extract the page ranges either using a `PdfCopy` instance or (if it is important to keep document level material) a `PdfStamper` after restricting the `PdfReader` using its `selectPages` method. Is this the information you are after or is your problem actually something else? – mkl Jul 24 '18 at 09:46
  • @mkl : i have updated my post by explaining what i have done (step 1 : retrieve image, step 2 : QR code ) – anakin59490 Jul 25 '18 at 15:02
  • Can you share a sample PDF with your kind of bar codes / QR codes? I ask because they might not be contained as bitmap images at all, or they might be contained in a way that iText by default does not trigger an event for while parsing. In that case the images found might be different ones, so not finding the bar codes / QR codes in them might be correct... because they do not contain them to start with. – mkl Jul 25 '18 at 16:09
  • @mkl : I have done tests with pdf document that only have one page that contains the right QR Code and the text is well retrieved. So I have generated a big pdf with 14 pages. Two of them contain the right QR Code. Then I manage only these QR code and it work's fine. Thank you again ! – anakin59490 Jul 26 '18 at 15:02

1 Answers1

0

it works for me with following method :

step 1:

detect the specific QR Code and store the page number in list:

PdfDocument pdfDoc;
pdfDoc = new PdfDocument(new PdfReader(pathName));
logger.debug("pdfDoc OK");
PdfDocumentContentParser contentParser = new PdfDocumentContentParser(pdfDoc);
List<Integer> pageList = new ArrayList<Integer>();
int[] currentPage = new int[1];
for ( int page = 1; page <= pdfDoc.getNumberOfPages(); page++) {
   currentPage[0] = page;
   contentParser.processContent(page, new IEventListener() {
   @Override
   public Set<EventType> getSupportedEvents() { 
        return Collections.singleton(RENDER_IMAGE);
   }

   @Override
   public void eventOccurred(IEventData data, EventType type) {
        index = index + 1;
        if (data instanceof ImageRenderInfo) {
            logger.debug("data instanceof ImageRenderInfo"); 
            ImageRenderInfo imageRenderInfo = (ImageRenderInfo) data;
            byte[] bytes = imageRenderInfo.getImage().getImageBytes();
            String pngName = coreServices.getSplitFolderTemp() +"Page-" + currentPage[0] +  "_Image-" + index + ".png";
            logger.debug("pngName: " + pngName);
            File image = new File(pngName);
            try {
                // le QR code KLINCK est stocké dans la première image de la feuille de séparation. 
                if (index == 1) {
                    // ZXING - > Read Data from QR Code
                    Files.write(new File(pngName).toPath(), bytes);
                    BufferedImage bufferedImg = ImageIO.read(image);
                    LuminanceSource source = new BufferedImageLuminanceSource(bufferedImg);
                    BinaryBitmap bitmap = new BinaryBitmap(new HybridBinarizer(source));
                    Result result = new MultiFormatReader().decode(bitmap);
                    if (result.getBarcodeFormat().toString().equals("QR_CODE") && result.getText().toString().equals("SEPARATEUR")) {
                    // on stocke les numéros de pages des QR Code Klinck
                       pageList.add(currentPage[0]);
                       logger.debug("QR code Klinck trouvé en page: " + currentPage[0]);
                   }
                 }
            }
             catch (Exception e) {
            logger.error("l'image détectée n'est pas le QR Code Klinck : " + ExceptionUtils.getStackTrace(e));
         }
         if (image.delete())
            logger.debug("immage supprimée");
                                            }
    }
    int index = 0;
 });

}

Step 2: create pdfs

logger.debug("Création des PDFs");
if (pageList.size() == 0) {
    logger.debug("un seul document ");
    PdfDocument pdfDest = new PdfDocument(new PdfWriter("C:/alfresco/klinck/onePdf.pdf"));
    pdfDoc.copyPagesTo(1,pdfDoc.getNumberOfPages(), pdfDest);
    pdfDest.close();
} else {
    // 2) Un ou plusieurs QR code = au moins deux documents
    logger.debug("longueur liste: " + pageList.size());
    int start = 1;
    for (int index = 0; index < pageList.size(); index++) {
        logger.debug("QR Code Klinck trouvé en page " + pageList.get(index) );
        logger.debug("Prochain document , page " + start + " à " + pageList.get(index) + "- 1");
        // la 1ère page du document initial ne doit pas être un séparateur
        if (pageList.get(index) != 1) {
            PdfDocument pdfDest = new PdfDocument(new PdfWriter("C:/alfresco/klinck/splitPdf-" + start + ".pdf"));
            pdfDoc.copyPagesTo(start,pageList.get(index)-1, pdfDest);
            pdfDest.close();
        }
        start = pageList.get(index) + 1;
    }

    // gestion du dernier document
    PdfDocument pdfDest = new PdfDocument(new PdfWriter("C:/alfresco/klinck/splitPdf-" + start + ".pdf"));
    pdfDoc.copyPagesTo(start, pdfDoc.getNumberOfPages(), pdfDest);
    pdfDest.close();

}

pdfDoc.close();
mkl
  • 90,588
  • 15
  • 125
  • 265
anakin59490
  • 630
  • 1
  • 11
  • 28