0

I'm currently creating an app that will iterate through a number of URLs, it pulls down the source code then extracts specific data using reference points like element ids etc.

The source code is loaded into a String object then processed by finding the IndexOf the reference point and performing a SubString.

The problem is the String object is generation 2 in Garbage Collection, which means it sticks around in memory for a while before being collected. Meaning after accessing more and more URLs the memory usage of the app continues to grow.

I have ran the app and processed 25 URLs, the memory usage jumped to 300Mb and after a while - I assume after garbage collection has fired - the memory usage fell back down to 1Mb.

So since I only need the data for a short amount of time, to extract the data, is there a more optimised way of doing this?

Note I can't read the source in chunks as separation could occur part way through a reference point.

I.E.

...<a href="http://www.some-website.com/" id="link-I-need">Hyperlink</a>...

could be separated as such

...<a href="http://www.some-website.com/" id="link-] (End of first chunk) - (Start of second chunk) [I-need">Hyperlink</a>...
Gary Connell
  • 305
  • 4
  • 10
  • What I mean is the program, on start up, is running at about 1Mb then I begin processing it jumps to 300Mb. But when processing is complete it is still sitting at 300Mb, then after a while,if falls back down to 1Mb. – Gary Connell Oct 10 '12 at 12:29
  • 1
    Yes, that's how a garbage collector works. It collects when you *use* memory, not when you *stop* using memory. – Hans Passant Oct 10 '12 at 12:33
  • Wouldn't I be right in thinking that if an objects generation is 2, they are classed as a 'long life object', which means garbage collection occurs differently. But this string object isn't a long life object, I only need to use it for a short amount of time. I believe the reason it is a generation 2 object is because it is bigger than 84Kb, meaning it is a Large Object, handled by the Large Object Heap and automatically a generation 2 object. – Gary Connell Oct 10 '12 at 12:58
  • Can you not just call GC.Collect() occasionally? So download URL/file, do relevant processing, make sure the strings/other data is out of scope and/or set to Null and call GC.Collect() before moving onto the next download. Though that said, 300MB isn't really that much. I know we don't want to make huge bloated programs, but if the memory is sitting there otherwise idle, and you're not into swap space, or throttling other applications, then I'd just let the GC do it's thing. – Gareth Wilson Oct 10 '12 at 14:04

3 Answers3

0

If you write your code in such a way that no string is within scope longer than it has to be, the CLR will collect it whenever it deems the right time. So when your program needs memory, the CLR will make that memory available.

How the CLR works and when it cleans up is of no concern to user code unless you are doing time-sensitive operations.

Roy Dictus
  • 32,551
  • 8
  • 60
  • 76
  • The problem isn't that clean up isn't occurring, it's that it isn't occurring quickly enough. I understand that if the program needs memory it will run GC, the problem I have is that the program could be using 500Mb, 1Gb, 1.5Gb of memory before it realises it _needs_ to run GC, meaning the program is using up all the system resources (and negatively affecting the rest of the system) before GC kicks in. – Gary Connell Oct 10 '12 at 12:44
  • Aha, so the real question is: "can I constrain the resources assigned to a .NET process?" Is that right? – Roy Dictus Oct 10 '12 at 12:59
  • Well if that would stop the program hogging all the system resources before GC kicks in, then yes :) – Gary Connell Oct 10 '12 at 13:03
0

Have you considered a different methodology such as an HTML parser? An HTML parser could be more efficient than what you are attempting. The following article may be helpful: What is the best way to parse HTML in C#?

Community
  • 1
  • 1
Mike Cowan
  • 919
  • 5
  • 11
-1

If you're not already, use a StringBuilder object and append to the builder instead of concatenating strings.

At the end of each processing iteration you can clear the StringBuilder and release the memory.

Rob Hardy
  • 1,821
  • 15
  • 15
  • I am not currently performing concatenation. I have tried various methods such as WebClient.DownloadString, WebBrowser.DocumentText etc which only output a String object. But if I was to load this string into a StringBuilder, then cleared it after processing, could this solve the problem? – Gary Connell Oct 10 '12 at 12:39
  • Perhaps if you created the StringBuilder **on** the large string you already have, but I can't confirm that. – Rob Hardy Oct 10 '12 at 12:41