0

In my application, i will receive a file. I have to check whether the file has searchable text(text content) or non searchable text(images) and display.

I cannot go with the file extension, because in PDF files, we can have non searchable types also.

I need java code for this. Can anyone help me please.

user1332962
  • 91
  • 1
  • 3
  • 11
  • 1
    I think this link can help you: http://stackoverflow.com/q/620993/1001027 – Francisco Spaeth Jun 09 '12 at 00:39
  • 1
    In the case of PDF files, you'd have to actually open the file and examine its structure to see what sort of data it contains. Same goes for other file types, such as Word documents. This is a significant amount of work: you have to actually implement support for each file format you want your program to understand. There's no magic `File.containsSearchableData()` method. – Wyzard Jun 09 '12 at 00:42

2 Answers2

0

A practical solution to this problem will involve figuring out the MIME type of the unknown files from the file content. Then you'd need to build a mapping from MIME types to classes for extracting text for the corresponding file type.

There are libraries for doing the first part (identifying MIME types), though this is a heuristic process, and can (in theory) return the wrong answer or (in practice) "unknown". Here is a sample of SO questions and other references on how to do this:

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

This lies in the area of data mining, and also search engine (Lucene). There are many converters (pdftotext, htmltotext, unzip, etcetera). Then the character encoding plays a role; UTF16-LE uses two bytes per char. Some file types have identifying headers, magic cookies (JPEG, GIF, PDF).

Best to do an internet research for projects that best suit your needs. And then add features incrementally, after having designed a functioning pipeline.

If you need a design, the dead standard for data mining, JDM 2.0 might offer an API.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138