16

I responded to another question about developing for the iPhone in non-Objective-C languages, and I made the assertion that using, say, C# to write for the iPhone would strike an Apple reviewer wrong. I was speaking largely about UI elements differing between the ObjC and C# libraries in question, but a commenter made an interesting point, leading me to this question:

Is it possible to determine the language a program is written in, solely from its binary? If there are such methods, what are they?

Let's assume for the purposes of the question:

  • That from an interaction standpoint (console behavior, any GUI appearance, etc.) the two are identical.
  • That performance isn't a reliable indicator of language (no comparing, say, Java to C).
  • That you don't have an interpreter or something between you and the language - just raw executable binary.

Bonus points if you're language-agnostic as possible.

Community
  • 1
  • 1
Tim
  • 59,527
  • 19
  • 156
  • 165

8 Answers8

16

Short answer: YES

Long answer:

If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.

Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv

Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.

At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.

For other programs, I've sometimes been able to find the path from which a program was compiled.

So, yes, it is possible to determine the source language from the binary.

Kelly S. French
  • 12,198
  • 10
  • 63
  • 93
  • 2
    This all relies on people linking libraries. What happens if that's done statically, or the functions or copy/pasted into the source? It's a great tip (+1 from me), but it's not always reliable. – Tim Nov 10 '09 at 00:44
  • Entry points exist even when an executable is statically linked. They are based on functions defined regardless of which object module they came from or how they were linked. Functions loaded at runtime don't have their names in the entrypoint table but would have to be mentioned in the data segment somewhere because it's needed by the run-time loader. You are right about the copy/pasted source to a degree. The only way around this would be if all the code were in main and no libraries were linked in. – Kelly S. French Nov 10 '09 at 02:04
10

I'm not a compiler hacker (someday, I hope), but I figure that you may be able to find telltale signs in a binary file that would indicate what compiler generated it and some of the compiler options used, such as the level of optimization specified.

Strictly speaking, however, what you're asking is impossible. It could be that somebody sat down with a pen and paper and worked out the binary codes corresponding to the program that they wanted to write, and then typed that stuff out in a hex editor. Basically, they'd be programming in assembly without the assembler tool. Similarly, you may never be able to tell with certainty whether a native binary was written in straight assembler or in C with inline assembly.

As for virtual machine environments such as JVM and .NET, you should be able to identify the VM by the byte codes in the binary executable, I would expect. However you may not be able to tell what the source language was, such as C# versus Visual Basic, unless there are particular compiler quirks that tip you off.

Parappa
  • 7,566
  • 3
  • 34
  • 38
  • 1
    It is possible. See http://stackoverflow.com/questions/1704202/determine-source-language-from-a-binary/1704449#1704449 – Kelly S. French Nov 10 '09 at 00:43
  • 2
    It seems to me that in theory it is not possible, and in practice it is. :) – Parappa Nov 10 '09 at 22:10
  • If it was 100% assembly, you could tell that from examining the binary. Theoretically, someone could write a program in FORTRAN and then run it through a fortran-to-c app to obtain 'C' source code. When that gets compiled it is possible that was was no traces to indicate the original language was not 'C'. That begs the question of what exactly qualifies as the "language it was written in". Maybe the question could be more specific this way, "Can you tell what language was used to create this binary?" In other words, the language that was translated into binary. – Kelly S. French Mar 09 '11 at 16:40
3

what about these tools:

PE Detective

PEiD

both are PE Identifiers. ok, they're both for windows but that's what it was when i landed here

Christian Casutt
  • 2,334
  • 4
  • 29
  • 38
1

I expect you could, if you disassemble the source, or at least you may know the compiler, as not all compilers will use the same code for printf for example, so Objective-C and gnu C should differ here.

You have excluded all byte-code languages so this issue is going to be less common than expected.

James Black
  • 41,583
  • 10
  • 86
  • 166
1

First, run what on some binaries and look at the output. CVS (and SVN) identifiers are scattered throughout the binary image. And most of those are from libraries.

Also, there's often a "map" to the various library functions. That's a big hint, also.

When the libraries are linked into the executable, there is often a map that's included in the binary file with names and offsets. It's part of creating "position independent code". You can't simply "hard-link" the various object files together. You need a map and you have to do some lookups when loading the binary into memory.

Finally, the start-up module for C, C++ (and I imagine C#) is unique to that compiler's defaiult set of libraries.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • What if you statically link all the libraries that you can. – James Black Nov 09 '09 at 22:13
  • @James Black: Doesn't change a thing. The .o's are just concatenated into the executable along with some instructions to the loader for how to populate the material in memory. – S.Lott Nov 09 '09 at 22:15
0

No, the bytecode is language agnostic. Different compilers could even take the same code source and generate different binaries. That's why you don't see general purpose decompilers that will work on binaries.

David
  • 34,223
  • 3
  • 62
  • 80
0

The command 'strings' could be used to get some hints as to what language was used (for instance, I just ran it on the stripped binary for a C application I wrote and the first entries it finds are the libraries linked by the executable).

Jason Machacek
  • 994
  • 1
  • 8
  • 18
-1

Well, C is initially converted the ASM, so you could write all C code in ASM.

alternative
  • 12,703
  • 5
  • 41
  • 41
  • Okay, its true that not all C compilers necessarily work this way, but you _can_ generate asm code with `gcc` with `-S` so I don't think this warrants a downvote. – alternative Apr 22 '12 at 13:13
  • 1
    This is a comment, not an answer. But upvoted back to zero as it's a good comment. – Todd Main May 13 '15 at 02:20
  • @Todd Main Heavily disagree. My answer is "no" as I've provided a counterexample. In a general sense the answer is "usually you can tell" but in a strict sense you only need a single counterexample for the answer to be that it is impossible. – alternative May 23 '15 at 17:41
  • @alternative but to say that a single counterexample shows that it is impossible is not entirely helpful unless the question was, "can you always detect the source language from the binary." A somewhat more accurate answer is, 'usually but not always, there are cases where it might be impossible'. To say that it is impossible implies that it is never possible which is not true. Sorry, sometimes I get stuck on word choices,I don't really disagree absolutely just in degree. If you can't find evidence of source you can say it must be 100% assembly or that the author is obfuscating on purpose. – Kelly S. French Oct 19 '15 at 14:29