I think that there's a fundamental flaw in your reasoning: that the fact that it takes 68% of execution time in the optimized version vs just the 9% in the unoptimized version means that the unoptimized version performs better.
I'm quite sure, instead, that the -O3 version performs better in absolute terms, but the optimizer did a way better job on the other functions, so, in proportion to the rest of the optimized code, the given subroutine results slower - but it's actually faster - or, at least, as fast - than the unoptimized version.
Still, to check directly the differences in the emitted code you can use the -S
switch. Also, to see if my idea is correct, you can roughly compare the CPU time took by the function in -O0 vs -03 multiplying that percentage with the user time took by your program provided by a command like time
(also, I'm quite sure that you can obtain a measure of absolute time spent in a subroutine in gprof, IIRC it was even in the default output).