4

I'm writing an application in C# that cycles through the articles of a local database copy of Wikipedia. I use a bunch of regexes to find the right information in these articles, launch a thread to fetch an image for each article, save the information and go to the next article.

I need to use a list of proxy to download these images to not get banned from google. As proxy can be slow, I use threads to make parallel downloads.

If I don't use threads, the application is working right but it takes a while to get all the information.

If I use threads, the application is working until it uses around 500 threads and then I get a OutOfMemory exception.

The thing is it use only ~300Mo of RAM so it don't uses all the memory of either the total memory availlable (8Go) and the memory allocated to a single 32bit application.

Is there a limit of thread per application ?

EDIT:

Here is the code to download the poster (started with getPosterAsc()).

    string ddlValue = "";
    private void tryDownload(object obj)
    {
        WebClient webClientProxy = new WebClient();
        Tuple<WebProxy, int> proxy = (Tuple<WebProxy, int>)((object[])obj)[0];
        if (proxy != null)
            webClientProxy.Proxy = proxy.Item1;
        try
        {
            ddlValue = webClientProxy.DownloadString((string)((object[])obj)[1]);
        }
        catch (Exception ex) { 
            ddlValue = "";

            Console.WriteLine("trydownload:" + ex.Message);
        }

        webClientProxy.Dispose();
    }

    public void getPoster(object options = null)
    {
        if (options == null)
            options = new object[2] { toSave, false };
        if (!AppVar.debugMode && AppVar.getImages && this.getImage)
        {
            if (this.original_name != "" && !this.ambName && this.suitable)
            {
                Log.CountImgInc();

                MatchCollection MatchList;
                string basic_options = "";
                string value = "";
                WebClient webClient = new WebClient();
                Regex reg;
                bool found = false;

                if (original_name.Split(' ').Length > 1) image_options = "";

                if (!found)
                {
                    bool succes = false;
                    int countTry = 0;
                    while (!succes)
                    {
                        Tuple<WebProxy, int> proxy = null;
                        if (countTry != 5)
                            proxy = Proxy.getProxy();

                        try
                        {
                            Thread t = new Thread(tryDownload);
                            if (!(bool)((object[])options)[1])
                                t.Start(new object[] { proxy, @"http://www.google.com/search?as_st=y&tbm=isch&as_q=" + image_options + "+" + basic_options + "+" + image_options_before + "%22" + simplify(original_name) + "%22+" + " OR %22" + original_name + "%22+" + image_options_after + this.image_format });
                            else
                                t.Start(new object[] { proxy, @"http://www.google.com/search?as_st=y&tbm=isch&as_q=" + image_options + "+" + basic_options + "+" + image_options_before + "%22" + simplify(original_name) + "%22+" + " OR %22" + original_name + "%22+" + image_options_after + "&biw=1218&bih=927&tbs=isz:ex,iszw:758,iszh:140,ift:jpg&tbm=isch&source=lnt&sa=X&ei=kuG7T6qaOYKr-gafsOHNCg&ved=0CIwBEKcFKAE" });
                            if (!t.Join(40000))
                            {
                                Proxy.badProxy(proxy.Item1.Address.Host, proxy.Item1.Address.Port);
                                continue;
                            }
                            else
                            {
                                value = ddlValue;
                                if (value != "")
                                    succes = true;
                                else
                                    Proxy.badProxy(proxy.Item1.Address.Host, proxy.Item1.Address.Port);
                            }
                        }
                        catch (Exception ex)
                        {
                            if (proxy != null)
                                Proxy.badProxy(proxy.Item1.Address.Host, proxy.Item1.Address.Port);
                        }
                        countTry++;
                    }

                    reg = new Regex(@"imgurl\=(.*?)&amp;imgrefurl", RegexOptions.IgnoreCase);
                    MatchList = reg.Matches(value);
                    if (MatchList.Count > 0)
                    {
                        bool foundgg = false;
                        int j = 0;
                        while (!foundgg && MatchList.Count > j)
                        {
                            if (MatchList[j].Groups[1].Value.Substring(MatchList[j].Groups[1].Value.Length - 3, 3) == "jpg")
                            {
                                try
                                {
                                    string guid = Guid.NewGuid().ToString();
                                    webClient.DownloadFile(MatchList[j].Groups[1].Value, @"c:\temp\" + guid + ".jpg");

                                    FileInfo fi = new FileInfo(@"c:\temp\" + guid + ".jpg");
                                    this.image_size = fi.Length;

                                    using (Image img = Image.FromFile(@"c:\temp\" + guid + ".jpg"))
                                    {
                                        int minHeight = this.cov_min_height;
                                        if ((bool)((object[])options)[1])
                                            minHeight = 100;

                                        if (img.RawFormat.Equals(System.Drawing.Imaging.ImageFormat.Jpeg) && img.HorizontalResolution > 70 && img.Size.Height > minHeight && img.Size.Width > this.cov_min_width && this.image_size < 250000)
                                        {
                                            foundgg = true;
                                            image_name = guid;
                                            image_height = img.Height;
                                            image_width = img.Width;
                                            img.Dispose();
                                            if ((bool)((object[])options)[0])
                                            {
                                                Mediatly.savePoster(this, (bool)((object[])options)[1]);
                                            }
                                        }
                                        else
                                        {
                                            img.Dispose();
                                            File.Delete(@"c:\temp\" + guid.ToString() + ".jpg"); 
                                        }
                                    }
                                }
                                catch (Exception ex)
                                {
                                }
                            }

                            j++;
                        }
                    }
                }

                webClient.Dispose();
                Log.CountImgDec();
            }
        }
    }

    public void getPosterAsc(bool save = false, bool banner = false)
    {
        ThreadPool.QueueUserWorkItem(new WaitCallback(getPoster), new object[2] { save, banner });
    }
Irwin
  • 12,551
  • 11
  • 67
  • 97
Sébastien
  • 1,667
  • 3
  • 20
  • 31
  • can't tell without some codes shown – Raptor May 23 '12 at 08:34
  • 4
    Don't launch a separate thread for each item (image?) you fetch. That will cause at least 500 MB to be allocated (since each thread has at least a stack of 1MB allocated + other resources). You should rather use the [ThreadPool](http://msdn.microsoft.com/en-us/library/system.threading.threadpool.aspx) or [Tasks](http://msdn.microsoft.com/en-us/library/system.threading.tasks.task.aspx) and let them drain the queue. Note: that this is a (over) simplification. Check [this](http://stackoverflow.com/q/145304/21567) SO answer for more details and explanation. – Christian.K May 23 '12 at 08:38
  • @ShivanRaptor : Can't show the code as the entire code make over 40000 lines. – Sébastien May 23 '12 at 08:58
  • @Christian.K : What is strange is that the memory usage in the task manager show no more than 300 MB. The problem with ThreadPool is that all object fetch a big amount of data from the wikipedia article and one object can contain some other objects (e.g. movies and actors) ; ThreadPool will leave all these objects alive until the thread has been launched and terminated. – Sébastien May 23 '12 at 08:58
  • 1
    Then you merely need to impose a limit on concurrent threads based upon system information. – MoonKnight May 23 '12 at 09:01
  • @Killercam : that's the problem, I need to have the maximum number of thread possible to optimize the execution time but this maximum number depend on either I compile on 64 or 32 bit or if the system (OS) was just restarted or the amount of availlable memory instead I didn't use all this availlable memory... – Sébastien May 23 '12 at 09:08
  • @Sébastien "I need to have the maximum number of thread possible" I understand, but the maximum possible is not N (where N is long.Max or some other stupendiously big integer), it is limited by the local system. All I am saying is spawning N threads "to save time" is _not_ the right way to go about this. Are you pooling your threads? – MoonKnight May 23 '12 at 09:51
  • 1
    You are probably exhausting the kernel memory pool with this code. Lots of I/O buffers that don't get read in time. There's just no scenario where using 500 threads makes sense in this context. – Hans Passant May 23 '12 at 10:01

4 Answers4

3

I would make sure that you are using the Thread Pool to 'manage' your threads. As someone has said each thread consumes around 1MB of memory, and depending on system hardware this could be causing your problem.

One potential way to broach this issue is to use the Thread Pool. This cuts the overheads incurred by spawning all your threads by sharing and recycling threads where possible. This allows low level threading facility (with many threads active) but limits the performance penalty of doing so.

The thread pool also keeps a limit on the number of worker threads (note, these will all be background thread) it will run simultaneously. Too many operational threads are a large administrative overhead and can "render the CPU cache ineffective". Once the thread pool limit that you will impose is reached, the additional jobs will be queued and execute when another worker thread becomes free. This, I feel is a much more effective, safer and resource efficient way of doing what you require.

Depending on your current code there are a number of ways to enter the thread pool:

  1. BackgroundWorker.
  2. ThreadPool.QueueUserWorkItem.
  3. asyncronous delegates.
  4. PLINQ.

Personally I would use TPL as it is awesome! I hope this helps.

MoonKnight
  • 23,214
  • 40
  • 145
  • 277
  • In fact, I already use BackgroundWorkers to keep my WindowForm alive when I do the work. Actualy I use 10 BW (one for each Wikipedia language as I fetch information for many languages). These BW cycle through the articles and fetch information (i.e. if it's a movie, it also get actors, crew and companies). After each articles, he download async the poster of the movie. If I use ThreadPool it quickly take much memory as all objects (with their actor, crew,...) will be kept in the pool. – Sébastien May 23 '12 at 11:00
  • Show some example code. How can we help you with out it. It doesn't have to be the 40,000 lines, write a section indicative of your current scenario. – MoonKnight May 23 '12 at 11:33
  • 1
    Ok, I just edit my post with the main methods to download the images ;) – Sébastien May 23 '12 at 11:46
1

Using perfmon check what is actually using the memory, in particular pay close attention to the 'Modified Page List Bytes' value. This can be particularly troublesome on multithreaded applications where a reference is being kept to a file for a particular length of time - the usual (temporary) resolution for high utilisation of this value is to increase the virtual memory available.

Also if running highly threaded applications on windows server 2008 you will need to apply dynacache from Microsoft to prevent the system file cache from effectively eating your available memory.

Both of the issue above can be directly related back to .net multithreaded applications processing large amounts of data, unfortunately they don't show up as being used by your application and as a result can be hard to track down (as I found out over the course of a painful few days)

Johnv2020
  • 2,088
  • 3
  • 20
  • 36
0

When you use a 32bit executable you can actually allocate only 2Gb by default and not 8Gb (see here for more information: http://blogs.msdn.com/b/tom/archive/2008/04/10/chat-question-memory-limits-for-32-bit-and-64-bit-processes.aspx)

try limiting your working threads so you won't use that many and make sure you don't have a memory leak on the threads executed code.

wrap your thread execution with try... catch (if you get the OutOfMemoryException on the thread execution code) because it might be regarding the images you download

eyossi
  • 4,230
  • 22
  • 20
  • I know about 32bit limitations. As I said, the application don't use more than ~300MB. What is strange is that I already try to limit the number of thread and so it's working (no problem with images or memory leak). That's why I assume the sole problem is the number of thread. – Sébastien May 23 '12 at 09:03
0

I recently ran into a problem in one of my applications that looked very similar to this. It had to do with the amount of data being stored and used in a single "string" object. If I had to guess, your Out of Memory exception is coming from the initial assignment of

ddlValue = webClientProxy.DownloadString((string)((object[])obj)[1]);

If you can re-write it to do so, can you find a way to access the web return as a stream instead of reading the entire response into a string. You can then parse the web response by line using a stream reader.

Yes, I know this sounds very complicated, but it matches the solution I ended up having to use in my own code. I was dealing with pieces of things that were too large to store as a single string and had to access them directly from the stream instead.

Nevyn
  • 2,623
  • 4
  • 18
  • 32