3

I'm dealing with a large project that gets compiled as a shared object. Compiling it with DWARF-2 symbols (-g -feliminate-unused-debug-types) results in the .debug_info being around 700M.

If I add -feliminate-dwarf2-dups, the linker dies:

error adding symbols: Memory exhausted
ld returned 1 exit status

This is on a system with 4G RAM. Since this needs to be compiled on a wide range of systems, consuming over 4G RAM is not acceptable. I tried passing --no-keep-memory to ld, but it still fails.

ld normally optimizes for speed over memory usage by caching the symbol tables of input files in memory. This option tells ld to instead optimize for memory usage, by rereading the symbol tables as necessary. This may be required if ld runs out of memory space while linking a large executable.

I'm guessing ld loads all the symbols in memory then goes about finding dupes, which takes 5+ times the memory it takes to store them on disk.

Is there a simple way to incrementally do this? Like:

  1. Load the symbols from first .o file
  2. Load the symbols from next .o file
  3. Merge them, removing any duplicates
  4. goto 2.

I could link files two by two into temporary archives, then link those two by two, etc, but I really don't want to change the build process for this project. Maybe I could use objcopy to remove those segments, perform the dupe elimination separately, then insert the debug sections into the final ELF?

Is there any other tool that can perform these DWARF merges? dwarfdump only reads the files. Alternatively, can I invoke gcc/ld to just do this, instead of actually linking the files?

mtijanic
  • 2,872
  • 11
  • 26
  • 1
    Could you break it down into smaller projects that are linked together? That way you could chop the problem into smaller chunks. – OMGtechy Apr 13 '16 at 14:51
  • @OMGtechy That would be a huuuge project on its own. Anything that requires more than just adding CFLAGS or running a small tool in a few places in the make process would end up being over a man month of work. – mtijanic Apr 13 '16 at 14:55
  • You could take a look at http://stackoverflow.com/questions/1413171/what-is-strip-gcc-application-used-for and see if you can either a) use that to strip the duplicate symbols or b) reduce the size of the problem by removing some you know you don't need – OMGtechy Apr 13 '16 at 14:58
  • What architectures are the build systems? All 32-bit x86? If on Unix or Linux, what is the limit on the size of the process data segment (ulimit -d)? – Mark Plotnick Apr 13 '16 at 15:01
  • 4G ram is too low for a build machine; try it on a mroe powerful build machine to learn whether it's just the linker needing more memory or the whole build process idea not really working. (similar to a smoke-test or a feasibility test) – Shark Apr 13 '16 at 15:04
  • @OMGtechy I guess I could write a script that reads the symbols on two using `dwarfdump` or something, does a diff and then removes duplicates from one of them using `strip`. That's `n^2` calls. Sounds nasty but may work. – mtijanic Apr 13 '16 at 15:06
  • @MarkPlotnick This is amd64 Linux. – mtijanic Apr 13 '16 at 15:08
  • @Shark I also tried it on our build farm, and it crapped out there. Not sure what HW configuration those machines are, but "get more ram" is not an option. I'll try on a beefy machine to see if any amount is enough. – mtijanic Apr 13 '16 at 15:08
  • I said it for a reason, if having 8/16GB of RAM doesn't get the job done and the linker can't do it, perhaps it's not the linker's fault. Perhaps the build process idea just doesn't work. (that's the worser case in this situation, I'm hoping that some more ram lets it finish it's work so you simply know your minimum specs required for a complete build/run scenario) – Shark Apr 13 '16 at 15:12
  • @Shark just FYI, I tried it on a 16G machine and it ate around 7G and finished in some 30 seconds, so nothing wrong with the process, just eats a lot. – mtijanic Apr 15 '16 at 07:41
  • That's still good news though. – Shark Apr 15 '16 at 07:46

3 Answers3

3

There are two ways that I know of to reduce DWARF size. Which one you want depends on your intended purpose.

Fedora (and maybe other distros, I don't know) uses the dwz tool to compress DWARF. This works after the fact: you link your program or shared library, then run dwz. It is a "semantic" compressor, meaning it understands the DWARF and rewrites it into a smaller form. In DWARF terms it makes partial CUs and shares data that way. dwz also has a mode where it can compress data across different executables for more sharing.

dwz yields best compression. The major downside is that it doesn't fit very well into a developer workflow -- it is a bit slow, uses a a lot of memory, etc. It's great for distros though, and I think would be appropriate for some other deployment situations.

The other decent way to compress debuginfo is to use the -fdebug-types-section flag to gcc. This changes the DWARF output to put large types into their own sections. The types are hashed by their contents; then the linker merges these sections automatically.

This approach yields decent compression, because types are a substantial part of the DWARF; with decent performance, because merging identical sections this way in the linker is cheap. The major downside is that the compression isn't as good.

gdb understands both of these kinds of compression. Support in other tools is a bit more spotty.

Tom Tromey
  • 21,507
  • 2
  • 45
  • 63
  • Thanks. I've been running `dwz` for the past hour with no end in sight, so that's not a solution. Would love to see what it produces as an end result, so I know what to target. I'll try `-fdebug-types-section` next. Apparently, it is for DWARF-4, but maybe it'll work. – mtijanic Apr 13 '16 at 16:59
  • There is also compressed debug sections support in binutils 2.26, if you can use that. – dbrank0 Apr 14 '16 at 09:25
  • FYI: `dwz` and `-fdebug-types-section` both work nicely, when the build machine is strong enough. But on the variety of hardware I need to support, they take too long or just crash. I may add a check to see if your HW is good enough, and run them if so. – mtijanic Apr 15 '16 at 07:46
  • I'm at least in certain cases using `dwz` and `-fdebug-types-section`, so accepting this answer. For future reference, also useful is `objcopy --compress-debug-sections` (removes some 300M). Thanks. – mtijanic Apr 19 '16 at 11:41
2

In addition to Tom Tromey's suggestions, another thing you could do is build crosscompiler for e.g. x86_64 system, targeting all your wide variety of systems.

This completely eliminates the 4GB limit on linker memory, and may be faster (your wide variety of small systems probably have wimpy CPUs, and so building natively on them may take hours when cross-compiling for them on a brawny development machine will only take minutes).

Or you could use -fsplit-dwarf -- there is no need to link debug bits into your final shared library; you can keep them separate.

Update:

We already do use crosscompilers. The problem is that not all x86_64 systems we use for build machines are powerful enough to optimize this.

In that case, you don't have a hard 4GB requirement, and your claim that this is unacceptable is bogus. You could get 64GB of RAM for about USD500, and speed everything up.

developers may want to run the build on their primary system

If your developers have a primary system with less than 16GB, you are wasting their time.

Regarding -fsplit-dwarf, in cases where it is built remotely I'd still need to download the symbols as well, so there is little to gain

The gain is that you can build your binary.

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • We already do use crosscompilers. The problem is that not all `x86_64` systems we use for build machines are powerful enough to optimize this. Plus, developers may want to run the build on their primary system while also doing something else. If the whole build took 2G of RAM until now, and suddenly starts needing 8, I'll break everyone's workflow. Regarding `-fsplit-dwarf`, in cases where it is built remotely I'd still need to download the symbols as well, so there is little to gain. – mtijanic Apr 15 '16 at 07:50
  • "you don't have a hard 4GB requirement" we have hundreds of machines that'd need updating, which is pretty ambitious (and expensive). Not something I can just decide. "If your developers have a primary system with less than 16GB, you are wasting their time." They are also running their browsers, editors, office tools and whatnot. "The gain is that you can build your binary." I still can, it's just huge. I can't *optimize* it for size. Either way, the answer will be to use switches for every optimization (remove dups, compress, split) and let each machine's owner decide. Thanks! – mtijanic Apr 19 '16 at 11:36
1

Well, another thing to try is to use "gold" linker, for building large apps it might be faster/better, link: https://en.wikipedia.org/wiki/Gold_(linker)

Also gold linker together with split debug info worked quite well for me.

"Fission is implemented in GCC 4.7, and requires support from recent versions of objcopy and the gold linker.

Use the -gsplit-dwarf option to enable the generation of split DWARF at compile time. This option must be used in conjunction with -c; Fission cannot be used when compiling and linking in the same step.

Use the gold linker's --gdb-index option (-Wl,--gdb-index when linking with gcc or g++) at link time to create the .gdb_index section that allows GDB to locate and read the .dwo files as it needs them."

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • I'll try `gold` next. It could be just what I need. We didn't use it earlier because the same code is also sometimes linked into a kernel module, and `gold` can't do that, but I see no problems with making an exception here. The kernel module infrastructure already has ways to deal with 700M symbols. – mtijanic Apr 15 '16 at 07:52
  • @mtijanic And it was somewhat faster for me, apparently making data layout a bit more cache friendlier http://stackoverflow.com/questions/30010588/is-binary-linked-with-gold-linker-running-faster – Severin Pappadeux Apr 15 '16 at 14:23
  • @mtijanic meaning not only linker is faster, but app made by gold linker is a bit faster as well, sorry for confusion – Severin Pappadeux Apr 15 '16 at 16:53
  • FYI: `gold` also chokes on it on the lowest end systems (4+4G). The `--no-keep-memory` option on is ignored on `gold` so it may even be worse. – mtijanic Apr 19 '16 at 11:37