1

I have a Java servlet method that evaluates an uploaded file to determine its type and then that file is passed on to ffmpeg for processing. Tika is used to detect the type of file. Right now it seems that either Tika, or my implementation of it is only able to determine container rather than the stream within the container and that leads to a detection issue where the container might be an MP4 but the stream within is an AAC audio. In my Java code I have the following:

String fileType = tika.detect(uploadedFile);
String[] mediaAndType = fileType.split("/");
media = mediaAndType[0];
mediatype =  mediaAndType[1];

I have an file with the extension of .m4a that Tika detects as "video/3gpp" and MediaInfo detects as:

Format           : MPEG-4
Format profile   : 3GPP Media Release 4
Codec ID         : 3gp4 (isom/3gp4)

There is only one stream in the file:

Format         : AAC
Format/Info    : Advanced Audio Codec
Format profile : LC
Codec ID       : mp4a-40-2

If I can detect that the file has only an audio format stream in the container, then I can hand it off to ffmpeg for audio processing vs video processing (which throws an error when processing the file above).

I chose Tika originally because it is simple and fast and easy to implement so if there is a way to use Tika to get what I want, that would be best but I am open to other tools to determine the number and types of streams within the container. I tried using the MP4Parser in Tika but it didn't return anything useful.

So is there a better way/tool to detect the format of the stream within the container that is fairly lightweight?

Pete Helgren
  • 438
  • 1
  • 7
  • 21
  • IIRC There's an open Tika JIRA Issue for this. Tika has a dedicated container-aware detector for Ogg files which identifies the subtype based on what streams are there, but there's not yet one for the MOV / MP4 container type – Gagravarr Nov 09 '17 at 11:41
  • Thanks...Are you aware of any other detection library that might be able to achieve what I am after? Trying to keep it simple. If not, the next plan is to add some logic to my transcoding routine and see if only a single stream exists and if so, just attempt to transcode the audio... – Pete Helgren Nov 13 '17 at 13:34
  • If there's only one stream, there's a fair chance we can spot the type from one of the first few MP4/QuickTime atoms at the start of the file. Any chance you could raise an Apache Tika bug and attach a small file showing the issue, so we can take a look? – Gagravarr Nov 14 '17 at 13:29
  • Otherwise see https://stackoverflow.com/questions/5618363/is-there-a-way-to-use-ffmpeg-to-determine-the-encoding-of-a-file-before-transcod for identification examples – Gagravarr Nov 14 '17 at 13:29
  • I'd like to help out if I can. The files we receive are 40 minute lectures (both video and audio). I am not sure if I can truncate the file without affecting the contents but I'll see what I can do. Thanks for the link. We use Tika as a "routing" tool to tell us whether we have an audio file or a video to send it to different processes so using ffmpeg is an option but it's pretty "heavy". I might just pass the file to the video processing routine and then "switch" it to audio if I find a single stream. – Pete Helgren Nov 14 '17 at 21:52
  • See if you can get a 2 second recording? If not, ffmpeg isn't too bad in identify mode if the file is local – Gagravarr Nov 14 '17 at 23:04
  • Do you have a suggestion on how to grab just the first few seconds of the file, basically truncate it without altering it? I'd like to grab just a "chunk" of the video and create the bug report but everything I have looked at to shorten the file seems to re-process it...I worry that will remove whatever artifact you might need.... – Pete Helgren Nov 17 '17 at 16:22
  • It likely won't be playable after, but dd or similar should let you grab the first x few kb! You're right, anything that changes the atoms or the atom ordering will probably "break" the problem – Gagravarr Nov 17 '17 at 16:28

0 Answers0