-2

I am working with text files, images and documents (.log, .txt, .pdf, .doc, .docx, .jpeg, .jpg, .png, .tiff etc.).I need to get some metadata from files based on their content not from extensions. So, my questions are:

Q1. How can I differentiate b/w these category of files (plain text files, text documents(.docx), pdfs, images) using Java?

Q2. Any library in Java that would be helpful in this process?

Q3. Are pdfs containing scanned images and pdfs containing texts are different in terms of any properties or anything for that matter?

PS: I don't have much expertise on this, so kindly correct me if I am wrong in my questionnaire.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
GadaaDhaariGeek
  • 971
  • 1
  • 14
  • 33

2 Answers2

0

You can use something like Apache Tika for detecting the MIME type. It analyses the binary data to detect the MIME type.

PDFs are detected from the first few bytes (it's %PDF). If you want more information about metadata, you could use something like Apache PDFBox, which allows retrieving the metadata (see: https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html)

Ulf Jaehrig
  • 749
  • 5
  • 11
0

You can use Apache Tika content detection.

import java.io.File;

import org.apache.tika.Tika;

public class Typedetection {

   public static void main(String[] args) throws Exception {

      //assume example.mp3 is in your current directory
      File file = new File("example.mp3");//

      //Instantiating tika facade class 
      Tika tika = new Tika();

      //detecting the file type using detect method
      String filetype = tika.detect(file);
      System.out.println(filetype);
   }

Q3. Are pdfs containing scanned images and pdfs containing texts are different in terms of any properties or anything for that matter?

You can also extract image and text files from pdfs containing image & text files. This is called embedded extraction. Check this :

https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java