The main downside of 64bit mode is that pointers double in size. Alignment rules might also lead classes/structs to be bigger. Maybe your code just barely fit into cache in 32bit mode, but not 64. This is esp. likely if your code uses a lot of pointers.
Another possibility is that you call some external library, and your 32bit version of it has some asm speedups, but the 64bit version doesn't.
Use a profiler to see what's actually slow in your 64bit version. For Windows, Intel's VTUNE is maybe a good choice. You can see where your code is having a lot of cache misses. Comparing total cache misses between 32bit and 64bit should shed some light.
Re: -O1
vs. -O2
: Different compilers have different meanings for options. gcc and clang have:
-Os
: optimize for code size
-O0
: minimal / no optimization (most things get stored/reloaded from RAM after every step)
-O1
: some optimization without taking a lot of extra compile time
-O2
: more optimizations
-O3
: even more optimizations, including auto-vectorizing
Clang doesn't seem to document its optimization options, so I assume it mirrors gcc. (There are options to report on optimizations it did, and to use profile-guided optimization.) See the latest version of the gcc manual (online) for more descriptions of optimization options: e.g.
-Ofast
: -O3 -ffast-math
(and maybe "unsafe" optimizations.)
-Og
: optimize without breaking debugging. Recommended for the edit/compile/debug cycle.
-funroll-loops
: can help in some tight loops, but isn't enabled even at -O3
. Don't use for everything, because larger code size can lead to I-cache misses which hurt more. -fprofile-use
does enable this, so ideally just use PGO.
-fblah-blah
: there are a ton more specific options. Usually just use -O3
to pick the recommended set.