The approach to take if you want to detect a self-extractor varies depending on whether you want to detect self-extractors within a known set of formats with 100% reliability or whether you want to detect unfamiliar self-extractors with less reliability.
(Both approaches have their uses. The latter is good for calling in a human for a second opinion, for example.)
Option A would be to use the same approach archival tools use.
A self-extracting archive is just a regular archive, concatenated onto an EXE file, with the offsets fixed up. (For Zip files, you can do that manually by using zip -A
from Info-ZIP), so open the file and scan through, looking for valid RAR/Zip/etc. header/trailers. (To do it efficiently, use an algorithm like Aho-Corasick to search for all candidate strings in a single pass.)
For extra reliability, parse the MZ and NE or PE header to figure out how many bytes to skip to get past any potential matching strings within the EXE itself.
Option B would be to parse the MZ header as described by Medinoc but then, instead of looking for a specific section in the PE header, calculate the total length of the NE or PE binary (Win16 self-extractors do exist, as created by tools like WinZIP 6.3 SR-1 and below) and skip it all.
Then, do some heuristic check, such as comparing the size of the skipped EXE portion to the size of the file overall and deciding whether the smallness of the EXE portion and the largeness of the stuff concatenated onto it look characteristic of a self-extractor.
(Bearing in mind that this might also catch DPMI-based DOS applications if you don't do additional checking for non-NE/non-PE files to rule them out, since they also use that "stub plus stuff concatenated on" structure.)
The most reliable solution would be to combine both approaches:
Use option A and check for the identifying headers/trailers for all modern or historically common EXE-based self-extractor formats (7z, RAR, ACE, Zip, ARJ, ARC, Lha/LZH, Zoo, InnoSetup installer, NSIS installer, single-file InstallShield installers from the pre-.msi
era, or an EXE containing a .cab
or .msi
bundle.)
If you didn't get a match, use option A to rule out .NET EXE files, common DPMI extenders, and other common bulk content that might have been concatenated onto the EXE as a poor man's resource bundle. (eg. images, audio, video, etc.)
To create test files for DPMI EXEs, just compile a "Hello, World!" to the DPMI target using djgpp (Linux) and Open Watcom C/C++ (1.9, 2.0). djgpp will get you CWSDPMI, while OpenWatcom C/C++ includes the DOS/4GW, PMODE/W, DOS/32A, and CauseWay DPMI extenders, the Win386 windows extender, and is compatible with the other free/freed extenders. (PharLap's extenders, which Microsoft licensed for inclusion with with Microsoft C/C++, are the only notable ones I'm aware of which didn't get freed, but I believe Open Watcom can at least generate the binary that they're supposed to be prepended onto.)
You may also need to rule out executable packers, since they use a stub-based system. UPX is pretty much the only one in use today but, historically, there were a lot of them.
As a fallback, parse the MZ and LE, NE, or PE headers to properly count embedded resources (eg. icons) as part of the EXE portion and then, if the file is more than some percentage "extra data", it's likely to be a self-extractor.