7

I came across a link that shows how to hide number of files inside an image file: http://lifehacker.com/282119/hide-files-inside-of-jpeg-images more discussion on detection here: http://ask.metafilter.com/119943/How-to-detect-RARsEXEs-hidden-in-JPGs

I'm trying to find out what is a good way to programmatically detect whether an image file has other files hidden inside it? Should I try unzipping the file to see if other files come out of it?

I'm not bound programmatically but something that works well on the JVM would be great.

Update

One Approach:

Would something like this work (suggested by someone on metafilter)

$ cat orig.jpg test.zip > stacked.jpg
$ file stacked.jpg 
stacked.jpg: JPEG image data, JFIF standard 1.01
$ convert stacked.jpg stripped.jpg  # this is an ImageMagick command
$ ls -l
 11483 orig.jpg
322399 stacked.jpg
 11484 stripped.jpg
310916 test.zip

I could use JMagick for this approach.

Andrea
  • 11,801
  • 17
  • 65
  • 72
Jayson
  • 167
  • 1
  • 3
  • 7
  • I've updated the link. You are right, the hidden files would not be in the metadata. However, the problem still stands - how can I detect that the image file contains some hidden files inside it. – Jayson Jan 22 '13 at 03:38
  • You can't by magic, you could guess how the files were hidden in a given instance. But that can vary completely from an instance to another, you could create a different hiding method for example. – mmgp Jan 22 '13 at 03:57
  • Yes, you can detect that by magic - http://en.wikipedia.org/wiki/Magic_number_(programming) – Tesseract Jan 22 '13 at 04:10
  • @SpiderPig are you referring to magic numbers that identify file formats ? I can simply remove them. – mmgp Jan 22 '13 at 04:14
  • @mmgp I've updated the question with one of the approaches I found on the internet – Jayson Jan 22 '13 at 04:15

3 Answers3

2

Great question!

If all you want to check for is a RAR or ZIP file appended to the end of an image file, then running it through the unrar or unzip command is the easiest way to do it.

If you want a faster but less exact check, you can check for some of the special file format signatures that indicate certain types of files. The usual UNIX tool to identify file format is file. It uses a database of binary file signatures, whose format is defined in the magic(5) man page. It won’t find a RAR file for you at the end of a JPEG, because it only looks at the start of files to try to identify them quickly, but you might be able to modify its source code to do what you want. You could also reuse its database of file signatures. If you look at the archive file part of its database in the Rar files section, it shows this:

# RAR archiver (Greg Roelofs, newt@uchicago.edu)
0   string      Rar!        RAR archive data,

which indicates that if your JPEG file contains the four bytes Rar! that would be suspicious. But you would have to examine the Rar file format spec in detail to check whether more of the Rar file structure is present to avoid false positives—this web page also contains the four bytes Rar! but there are no hidden files attached to it :P

But if someone knows the details of your automated checks, they could easily work around them. The simplest workaround would be to reverse all the bytes of the files before appending them to the JPEG. Then none of your signatures would catch the reversed version of the file.


If someone really wants to hide a file inside an image, there are all sorts of ways to do that that you won’t be able to detect easily. The general term for this is “steganography.” The Wikipedia page, for example, shows a picture of trees that has a picture of a cat hidden inside it. For simpler steganographic methods, there are statistical tests that can indicate something funny has been done to a picture, but if someone spends a lot of time to come up with their own method to hide other files inside images, you won’t be able to detect it.

andrewdotn
  • 32,721
  • 10
  • 101
  • 130
  • 2
    @mmgp Please stop commenting on this thread. Your rude and unhelpful comments are not appreciated by anyone here. – andrewdotn Jan 22 '13 at 04:21
  • @andrew thanks. I'm not at all planning to tackle steganography from all aspects as illustrated by that tree-cat pic. However, I'm looking for ways to find if there is a completely separate file hidden inside the image. Sure, to begin with I don't know what file format could be hidden but I can target different formats one-by-one. If I target RAR and it is actually at the end of the JPEG then what might the options be? Can I examine JPEG bit-by-bit to see if it has a RAR in it? How can I do this? – Jayson Jan 22 '13 at 04:29
  • @Jayson In the case where there’s a RAR file appended, whether it’s appended to a JPEG, a PNG, or anything else doesn’t really matter. The archive part is outside the part defined by the image file format. RAR files start with the string `Rar!`, so you could scan byte-by-byte until you hit that, and then treat the bytes from then on as a RAR file—but the `unrar` tool already does that. To do something much more complicated you’d basically have to reimplement `unrar` in Java :/ – andrewdotn Jan 22 '13 at 04:40
0

To see if there's any metadata or other information appended to the file, you could decode the image and re-encode it to see if the size decreases dramatically. For a JPEG file you would want to do something like a lossless rotate that retains the original DCT data, otherwise the file size might change just through encoding differences.

A smaller result wouldn't be proof of hidden data, but it would be an indicator that you need to take a closer look.

You never shared your motivation for asking the question, but I'm going to guess that it's about downloading images to a public site. In that case you really shouldn't care whether the submitted image contains extraneous data, you should just cleanse the input regardless. The decode/re-encode process would be perfect for this.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • I don't see how this could work, honestly. You are assuming the file can be decoded, but what if I (as the one that hid the data) removed the data necessary for the file to be decoded ? I don't have any problem handling the files, because I know how I removed them. – mmgp Jan 22 '13 at 03:59
  • @mmgp, I thought we were starting with the assumption that we had a valid image file. Obviously if you invent your own image file format you can hide anything you want. – Mark Ransom Jan 22 '13 at 04:41
  • @mmgp, I apologize, my answer was unclear and you were reacting to that. What I meant to say was to decode the image part of the data, not the unknown part. I've slightly changed the wording to make that clear. – Mark Ransom Jan 22 '13 at 04:44
  • The problem is getting the image part of the data if you don't know the actual format of the data. Even if we take the simplest image formats, like the ones by netpbm, and simply exchange the first line with the second line, the ready tools won't attempt to read it since it fails the simplest of the tests that is done to attempt to identify it. After we settle on a lot of pre-conditions, then the question might be answerable. As it stands it can't, because we can make up any hiding process, and it doesn't need to invent a new format, just scramble it a little. – mmgp Jan 22 '13 at 04:47
  • @MarkRansom yeah, the assumption is that the image part is a valid image. – Jayson Jan 22 '13 at 04:48
  • @Jayson here is one of the simplest images I can create: P2 1 1 1 1. The ready tools will identify this as a Netpbm PGM image. Now, remove the initial `P`. Everything fails. Note that the image data (i.e., its pixels intensities, width, height) is intact. – mmgp Jan 22 '13 at 04:50
  • @mmgp I understand there doesn't seem to be a foolproof way to capture everything. My final goal is to detect whether the image contains hidden media files in it. I'm just trying to take first step in that direction. – Jayson Jan 22 '13 at 04:56
  • @Jayson then I think a more appropriate question would be: "I have this file X here, which is a image in a format Y. But I suspect it has some hidden content there, because A, B, and C. Is there something I can attempt to find the hidden content in this specific situation ? We can assume it is not a method that hides the data into the image itself, i.e., it doesn't change any bits of the image data itself.". Even then, I'm not sure if it is a good question, but it is better than the current one. – mmgp Jan 22 '13 at 05:01
  • 3
    @mmgp, I didn't see anything in the question that required deciphering the hidden content. It was merely a question of determining if there *was* hidden content, on a file that is masquerading as a valid image file. Creating a file that *isn't* a valid image is beyond the scope of the question as well. Your misunderstanding of the question borders on trolling. – Mark Ransom Jan 22 '13 at 16:24
  • @MarkRansom you must be missing something then. The question started with some link that showed a person "hidding" images in a rar (or some other format that I don't remember now). So first you have to decide which format was used to group the files in a compressed format, then you have to assume that the compression format is not fooling you so you can decompress it. Now you can try to decide if there are hidden images. – mmgp Jan 22 '13 at 17:22
0

You could search for the file signature. http://en.wikipedia.org/wiki/List_of_file_signatures e.g. for 7z file the sigature is 37 7A BC AF 27 1C for rar files it's 52 61 72 21 1A 07 00 and for zip it's 50 4B 03 04 Take a look at a compressed file in a hex editor e.g. HxD

Tesseract
  • 8,049
  • 2
  • 20
  • 37