0

I have a EBCDIC file from which i extracted images. However, there is some data on the images which is key source in identifying my transactions. Assume that i have an image as "stackoverflow logo" stored under name "img1.jpg" on my desktop and when i read it using the following code, it works

String inputImage = "C:\\Desktop\\img1.jpg";
File imageFile = new File(inputImage);
BufferedImage image1 = ImageIO.read(imageFile);
System.out.println(image1);

However, when i attempt the same with an image decoded from EBCDIC conversion, it returns null.

The difference i observed is that there is no color associated in the decoded image. Is there any way to read these images and retrieve the text on the image. Following is not the exact image which i am working on, but just to give an idea i am sharing a sample from internet. Note: The image am working on looks like a Scanned image (Grayscale) Example: enter image description here

Also, I observed that if i open the decode file and do a screen capture via snipping tool and store it as jpg file (which already is jpg) and read it, system is reading that file. not sure where is the issue, is it compression or color coding or format.

  • 3
    What is an EBCDIC file for you and how did you extract an image from it? From what I understand EBCDIC is a text encoding, alternative to ASCII https://en.wikipedia.org/wiki/EBCDIC – Joni Aug 11 '20 at 18:47
  • EBCDIC is a text formatting. Images are stored in binary format. These are totally different formats. What makes you think the file is in EBCDIC formatting? – NomadMaker Aug 11 '20 at 18:50
  • @NomadMaker the file which i receive is encoded under EBCDIC format which has both text and binary content. Text is kind of metadata of the binary image. this is actual a file which contains image of a cheque and its related data. I did extract both of them separately. – Codester2020 Aug 11 '20 at 18:57
  • Standard Java implementation don't have support for the EBCDIC character set, so if the text embedded in the image data is EBCDIC, Java cannot decode it, and simply ignores it, leaving the decoded `BufferedImage` without properties. --- You can always do OCR scanning of the image to get the text in the image itself. – Andreas Aug 11 '20 at 19:22
  • @Andreas but the data which i receive is from external vendor and we don't have control over their process. All we have is the image shared in the encrypted file in Cp037 encoding. Looks like am out of option here then. thank you – Codester2020 Aug 11 '20 at 19:30
  • *"Looks like am out of option"* Seems you didn't read my comment in full, so let me repeat in bold so you can see it: **You can always do OCR scanning.** – Andreas Aug 11 '20 at 19:31
  • You said you have been able to extract the images though? What is the file format of the image? – Joni Aug 11 '20 at 19:31
  • @Joni TIFF format and i am using jai.codec library for decoding. – Codester2020 Aug 11 '20 at 22:04
  • TIFF images have been a problem in the past and I don't know what the current state is. According https://stackoverflow.com/questions/1954685/cant-read-and-write-a-tiff-image-file-using-java-imageio-standard-library the standard Java distribution supports TIFF since Java 9, so you don't need jai.codec if you have a modern Java version. Have you given it a try, or are you stuck with Java 8 or older? – Joni Aug 11 '20 at 22:12
  • @Joni yes working with Java 8. I reached out to my Infra team for update, hope i would get it. – Codester2020 Aug 11 '20 at 23:55
  • @Andreas thank you, am trying OCR or Tess4j but it would take another 2 days for me to get full system access. – Codester2020 Aug 11 '20 at 23:56

1 Answers1

0

Thank you everyone. I used Tess4j to decode the TIFF image. Unfortunately the information i was looking for isn't available in the decoded text. But, done with the POC. used the following library and added eng.traineddata in the folder where images exist

import net.sourceforge.tess4j.*;
String inputImage = "C:\\Desktop\\img1.tiff";
File imageFile = new File(inputImage);
ITesseract imageRead = new Tesseract();
imageRead.setDataPath("C:\\Desktop\\");
imageRead.setLanguage("eng");
String imageText = imageRead.doOCR(imageFile);
System.out.println(imageText);