0

I want to get existing OCR data in .tif files using Java. This OCR data is created using MS Office Document Image Writer. I have searched a little bit open source libraries but I couldn't find any library/tool which can retrieve/read attached OCR data.

How to get this OCR data in .tif files using Java?

Yakari
  • 57
  • 1
  • 10
  • Did you look at [this](http://stackoverflow.com/questions/1813881/java-ocr-implementation)? – home Aug 11 '11 at 08:35
  • That is not enough helpfull also not what i am looking for, but thanks. – Yakari Aug 11 '11 at 10:58
  • so you want to extract existing metadata already available in the tiff? – home Aug 11 '11 at 11:30
  • I am not sure wheter this data OCR or metadata is. But i want to extract all data which are attached into the tif. I used ExifTool to extract some content data but it retreives the contentdata only from last page if the tif consist of more than 1 pages. I dont know how can i retrieve more data using ExifTool ? – Yakari Aug 11 '11 at 13:40

1 Answers1

0

OCR Data which is created using MS Office Document Image Writer and the (other) Metadata can be retrieved using ExifTool.

Example:

String[] cmdLineInput = { "C:\\ExifTool\\exif.exe", "-ee",
        "C:\\images\\example.tif" };
ProcessBuilder processBuilder = new ProcessBuilder(cmdLineInput);
Process exif; // = processBuilder.start();

/**
 * CmdLineIpnut[1] = Fully qualified path to exiftool CmdLineIpnut[2] =
 * -ee // ( extract embedded ) option to extract data from multipaged
 * .tif files. CmdLineIpnut[3] = Fully qualified path to .tif file.
 */

String outputLine = "";

try {
    exif = processBuilder.start();
    BufferedReader brInput = new BufferedReader(new InputStreamReader(
            exif.getInputStream()));

    while ((outputLine = brInput.readLine()) != null) {
        System.out.println(outputLine);

    }
    exif.waitFor();

} catch (IOException ioe) {
    // handle exeception
}

You can parse some data from outputLine and store in an object to use for further handling, as example to save in a database.

CSchulz
  • 10,882
  • 11
  • 60
  • 114
Yakari
  • 57
  • 1
  • 10