2

I am working on a large windows desktop application that stores large amount of data in form of a project file. We have our custom ORM and serialization to efficiently load the object data from CSV format. This task is performed by multiple threads running in parallel processing multiple files. Our large project can contain million and likely more objects with many relationships between them.

Recently I got tasked to improve the project open performance which deteriorated for very large projects. Upon profiling it turned out that most of the time spent can be attributed to garbage collection (GC).

My theory is that due to large number of very fast allocations the GC is starved, postponed for a very long time and then when it finally kicks in it takes a very long time to the job. That idea was further confirmed by two contradicting facts:

  1. Optimizing deserialization code to work faster only made things worse
  2. Inserting Thread.Sleep calls at strategic places made load go faster

Example of slow load with 7 generation 2 collections and huge % of time in GC is below. Bad

Example of fast load with sleep periods in the code to allow GC some time is below. In this case wee have 19 generation 2 collections and also more than double the number of generation 0 and generation 1 collections. Good

So, my question is how to prevent this GC starvation? Adding Thread.Sleep looks silly and it is very difficult to guess the right amount of milliseconds in the right place. My other idea would be to use GC.Collect, but that also poses the difficulty of how many and where to put them. Any other ideas?

  • i wonder how/why Thread.Sleep made things faster? how much faster? – M.kazem Akhgary Nov 03 '15 at 16:26
  • 1
    I have seen this type of behavior in situations where a lot of duplicate string variables are created and then dereferenced as part of serialization. What kind of serialization are you using to load the project file? – Chris Shain Nov 03 '15 at 16:35
  • 1
    All the loading is done on background threads, so I assume sleep allows GC thread(s) to kick-in. From the images above you can see that we go from 4 minutes to 25 seconds once we introduce sleep. Slowdown to go faster :) – Slobodan Savkovic Nov 03 '15 at 16:35
  • We store data in CSV files and yes, the parser we use creates enormous amount of strings. In the future I have to look for better parser. I will post separate question on parsing... – Slobodan Savkovic Nov 03 '15 at 16:39
  • Might want to take a look at using WeakReference when loading up your collections. – dbugger Nov 03 '15 at 16:40
  • It is sort of obvious that % time in GC *has* to go down when you are sleeping instead of executing code. Beware of .NET 4.6, it has a [very nasty GC bug](http://stackoverflow.com/a/31774612/17034) that behaves exactly like this. – Hans Passant Nov 03 '15 at 16:40
  • 1
    GC is triggered by allocations. If you sleep no GC will be triggered during that time. I find the theory of GC starvation has very weak evidence. – usr Nov 03 '15 at 16:42
  • "huge % of time in GC is below" Looks like almost 0% GC time to me because the % time counter is only updated in case of a GC. The long horizontal red line actually corresponds to no GC activity at all. – usr Nov 03 '15 at 16:44
  • 1
    @HansPassant Look in comments to this answer from that question- looks like that bug is fixed with a hotfix http://stackoverflow.com/a/31999381/4450618 – bkribbs Nov 03 '15 at 16:45
  • Maybe you should profile the app to see where time is actually spent. Profilers can should you GC time. Also turn on Server GC which can help a lot. – usr Nov 03 '15 at 16:47
  • GC will kick in as soon as the budget for gen 0 is exhausted, so a high allocation rate will trigger many GCs. It won't starve GC. I think the problem is something else. – Brian Rasmussen Nov 03 '15 at 16:54

2 Answers2

1

Based on the comments, I'd guess that you are doing a ton of String.Substring() operations as part of CSV parsing. Each of these creates a new string instance, which I'd bet you then throw away after further parsing it into an integer or date or whatever you need. You almost certainly need to start thinking about using a different persistence mechanism (CSV has a lot of shortcomings that you are undoubtedly aware of), but in the meantime you are going to want to look into versions of parsers that do not allocate substrings. If you dig into the code for Int32.TryParse, you'll find that it does some character iteration to avoid allocating more strings. I'd bet that you could spend an hour writing a version that takes a start and end parameter, then you can pass them the whole line with offsets and avoid doing a substring call to get the individual field values. Doing that will save you millions of allocations.

Chris Shain
  • 50,833
  • 6
  • 93
  • 125
  • I need to ask separate question on the topic of efficient CSV parsing. The problem there is that .NET stirs you toward using strings as you conveniently have StreamReader.ReadLine, String.Split and Single.TryParse all working with strings. Most of available CSV parsers do exactly that...Wish there was a mutable string that I could re-use over and over again. – Slobodan Savkovic Nov 03 '15 at 20:23
0

So, it appears that this is a .NET bug rather then GC starvation. The workarounds and answers described in this question Garbage Collection and Parallel.ForEach Issue After VS2015 Upgrade apply perfectly. I got best results by switching to GC server mode.

Note however, that I am experiencing this issue in .NET 4.5.2. Will add hotfix link if there is one.

Community
  • 1
  • 1