0

I ran my app twice (in the VS ide). The first time it took 33seconds. I decommented obj.save which calls a lot of code and it took 87seconds. Thats some slow serialization code! I suspect two problems. The first is i do the below

template<class T> void Save_IntX(ostream& o, T v){ o.write((char*)&v,sizeof(T)); }

I call this templates hundreds of thousands of times (well maybe not that much). Does each .write() use a lock that may be slowing it down? maybe i can use a memory steam which doesnt require a lock and dump that instead? Which ostream may i use that doesnt lock and perhaps depends that its only used in a single thread?

The other suspected problem is i use dynamic_cast a lot. But i am unsure if i can work around this.

Here is a quick profiling session after converting it to use fopen instead of ostream. I wonder why i dont see the majority of my functions in this list but as you can see write is still taking the longest. Note: i just realize my output file is half a gig. oops. Maybe that is why.

enter image description here

  • 2
    Have you tried using the C I/O library? It can be far faster than the C++ I/O library. Since you're really just copying bytes to the output stream (and not doing any formatting), it should be straightforward to convert the code and test it. – James McNellis Jun 02 '11 at 05:10
  • @James: I tried it out. It brought it down to 68seconds but that still is rather long. But i'll profile again to see why its still slow –  Jun 02 '11 at 05:38
  • I'll assume fwrite/ntdll/dynamic cast is 93% of my time. (Although 33/68 is 48% and not 7% which is strange) i'll assume most of the performance problem is the file bottleneck and i shouldnt really worry. I may consider reverting back to ostream and searching for a single threaded memory stream and try that out. –  Jun 02 '11 at 06:01
  • poking @James. I'm pretty confused about the profile chart but i guess theres not much more i can do? –  Jun 02 '11 at 06:02
  • This is an optimized build, right? (I'm guessing yes, given the large number of "unknown frames") Are you using instrumented or sampling profiling? However, if your output is half a gigabyte, then yes, I'd expect it to take time to write that much data to disk. – James McNellis Jun 02 '11 at 06:03
  • @James I tried instrumented as well as sampling and after i removed the debug padding. The output is now <120mb (there is a lot of padding. Fortunately in the loading and saving function). The timing was a lot quicker (less then 25 seconds) so i feel this is fast enough again. –  Jun 02 '11 at 08:14
  • You're floundering. Why don't you get some [stackshots](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024)? They will point you straight at the problem. – Mike Dunlavey Jun 02 '11 at 10:30
  • @Mike: Kind of doesnt work when you pause and see nothing but ntdll.dll!SomeAddr in the callstack with none of your code at all. Kind of makes you kind of wonder what is calling it doesn't give you that information at all. But the problem is already solved –  Jun 02 '11 at 13:47
  • Was there none of your code anywhere on the stack? In the thread of interest? Anyway, glad it's solved. – Mike Dunlavey Jun 02 '11 at 16:04
  • @Mike: I am not sure what happened but its a single threaded app and instead of showing a stack with my code somewhere in it, it showed only NTDLL and nothing else. It was like it was only considering that module/dll. But anyways after removing the 32bytes of padding between object my file size went way down and performance went way up. I'm considering zlibing it but i dont know if it will bring more performance or not –  Jun 02 '11 at 18:38

1 Answers1

1

I'm glad you got it figured out, but the next time you do profiling, you might want to consider a few points:

  • The VS profiler in sampling mode does not sample during I/O or any other time your program is blocked, so it's only really useful for CPU-bound profiling. For example, if it says a routine has 80% inclusive time, but the app is actually computing only 10% of the time, that 80% is really only 8%. Because of this, for any non-CPU-bound work, you need to use the profiler's instrumentation mode.

  • Assuming you did that, of all those columns of data, the one that matters is "Inclusive %", because that is the routine's true cost, in the sense that if it could be avoided, that is how much the overall time would be reduced.

  • Of all those rows of data, the ones likely to matter are the ones containing your routines, because your routines are the only ones you can do anything about. It looks like "Unknown Frames" are maybe your code, if your code is compiled without debugging info. In general, it's a good idea to profile with debugging info, make it fast, and then remove the debugging info.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • By debugging info i assume you mean my own debug and not compile in debug mode kind of way. –  Jun 03 '11 at 14:12
  • @acidzombie24: I mean the profiler has to know the names of routines and (preferably) line numbers, just as a debugger would. – Mike Dunlavey Jun 03 '11 at 14:40