0

I have a pdf document with 2 images & text and was expected barcode as one of the image. I was able to extract the other image and text content. But barcode neither return as image nor text context. How could I extract the barcode

I did override PDFStreamEngine

@Override
protected void processOperator( Operator operator, List<COSBase> operands) throws IOException
{
    String operation = operator.getName();
    if( "Do".equals(operation) )
    {
        COSName objectName = (COSName) operands.get( 0 );
        PDXObject xobject = getResources().getXObject( objectName );
        if( xobject instanceof PDImageXObject)
        {
            PDImageXObject image = (PDImageXObject)xobject;
            int imageWidth = image.getWidth();
            int imageHeight = image.getHeight();

            // same image to local
            BufferedImage bImage = new BufferedImage(imageWidth,imageHeight,BufferedImage.TYPE_INT_ARGB);
            bImage = image.getImage();
            ImageIO.write(bImage,"PNG",new File("c:\\temp\\image_"+imageNumber+".png"));
            System.out.println("Image saved.");
            imageNumber++;

        }
        else if( xobject instanceof PDFormXObject)
        {
           System.out.println("Form Object");
        }
        else if (xobject instanceof PDTransparencyGroup) {
            System.out.println("Transparency");
        }


    }
    else
    {
        super.processOperator( operator, operands);
    }
}

Also, when I used the following code snippet, barcode was not returned

 PDPageTree list = document.getPages();
        for (PDPage page : list) {
          PDResources pdResources = page.getResources();
          for (COSName c : pdResources.getXObjectNames()) {
            PDXObject o = pdResources.getXObject(c);
            System.out.println("COS " + c.getName());

          }
        }
mkl
  • 90,588
  • 15
  • 125
  • 265
jprism
  • 3,239
  • 3
  • 40
  • 56
  • Please share the PDF for analysis. Probably the bar code is not a bitmap image to start with but instead vector graphics or probably even text (with the bar code stripes as letters). Or it may be hidden in some Pattern (which usually is not inspected), or probably it is in some annotation. – mkl Jun 19 '20 at 15:41
  • Thanks for the input. Because of the security reason, I am not able pass the actual document.Is there anyway to inspect? If this vector graphics, how do I strip? – jprism Jun 19 '20 at 17:28
  • Re vector graphics: https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions – Tilman Hausherr Jun 19 '20 at 17:44
  • To get the value of the barcode, you can render the PDF to image and then feed it to ZXing. This will of course take longer. – Tilman Hausherr Jun 19 '20 at 17:45
  • pdResources.getExtGStateNames() to see for extended graphics state resources. Nothing. – jprism Jun 19 '20 at 17:49
  • pdResources.getExtGStateNames() has almost certainly nothing to do with barcodes. – Tilman Hausherr Jun 20 '20 at 10:01
  • The problem is not about the barcode. One image content, which is barcode, is not being extracted as PDXObject. – jprism Jun 20 '20 at 13:10
  • Try the ExtractImages command line tool. That is the gold standard. Does your barcode get extracted there? If not, how do you know your barcode is an image? – Tilman Hausherr Jun 22 '20 at 04:59
  • I don't know. How do I know? This is one of the challenge? – jprism Jun 22 '20 at 16:43
  • See the initial comments. Not all barcodes in PDF files in this world are images. They could be vector graphics. You can't share the file, so all you can do is to follow the suggestions given. Here: run the ExtractImages command line tool. – Tilman Hausherr Jun 23 '20 at 10:12

0 Answers0