0

I'm currently learning C# and I've been working on a XML parser for the last two days. It's actually working fine my issue is the amount of time it take to parse more than 10k pages. this is my code.

    public static void startParse(int id_min, int id_max, int numberofthreads)
    {
        int start;
        int end;
        int part;
        int threadnbrs;

        threadnbrs = numberofthreads;
        List<Thread> workerThreads;
        List<string> results;

        part = (id_max - id_min) / threadnbrs;
        start = id_min;
        end = 0;
        workerThreads = new List<Thread>();
        results = new List<string>();

        for (int i = 0; i < threadnbrs; i++)
        {
            if (i != 0)
                start = end + 1;
            end = start + (part);
            if (i == (threadnbrs - 1))
                end = id_max;

            int _i = i;
            int _start = start;
            int _end = end;

            Thread t = new Thread(() =>
            {

                   Console.WriteLine("i = " + _i);
                   Console.WriteLine("start =" + _start);
                   Console.WriteLine("end =" + _end + "\r\n");
                   string parse = new ParseWH().parse(_start, _end);
                   lock (results)
                   {
                       results.Add(parse);
                   }
            });
            workerThreads.Add(t);
            t.Start();
        }
        foreach (Thread thread in workerThreads)
               thread.Join();

        File.WriteAllText(".\\result.txt", String.Join("", results));
        Console.Beep();
    }

what i'm actually doing is splitting in different thread a range of element that need to be parsed so each thread handle X elements.

for each 100 elements it take approx 20 seconds. however it took me 17 minutes to parse 10 0000 Elements.

what i need is each thread working simultaneously on 100 of those 10 000 Elements so it can be done in 20 seconds. is there is a solution for that ?

Parse Code :

public string parse(int id_min, int id_max)
        {
            XmlDocument xml;
            WebClient user;
            XmlElement element;
            XmlNodeList nodes;
            string result;
            string address;
            int i;

            //Console.WriteLine(id_min);
            //Console.WriteLine(id_max);
            i = id_min;
            result = "";
            xml = new XmlDocument();
            while (i <= id_max)
            {
                user = new WebClient();
                // user.Headers.Add("User-Agent", "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30");
                user.Encoding = UTF8Encoding.UTF8;
                address = "http://fr.wowhead.com/item=" + i + "?xml";
                if (address != null)
                    xml.LoadXml(user.DownloadString(new Uri(address)));
                element = xml.DocumentElement;
                nodes = element.SelectNodes("/wowhead");
                if (xml.SelectSingleNode("/wowhead/error") != null)
                {
                    Console.WriteLine("error " + i);
                    i++;
                    continue;
                }
                result += "INSERT INTO item_wh (entry, class, subclass, displayId, ,quality, name, level) VALUES (";
                foreach (XmlNode node in nodes)
                {
                    // entry
                    result += node["item"].Attributes["id"].InnerText;
                    result += ", ";
                    // class
                    result += node["item"]["class"].Attributes["id"].InnerText;
                    result += ", ";
                    // subclass
                    result += node["item"]["subclass"].Attributes["id"].InnerText;
                    result += ", ";
                    // displayId
                    result += node["item"]["icon"].Attributes["displayId"].InnerText;
                    result += ", ";
                    // quality
                    result += node["item"]["quality"].Attributes["id"].InnerText;
                    result += ", \"";
                    // name
                    result += node["item"]["name"].InnerText;
                    result += "\", ";
                    // level
                    result += node["item"]["level"].InnerText;
                    result += ");";
                    // bakcline
                    result += "\r\n";
                }
                i++;
            }
            return (result);
        }
Phil
  • 795
  • 1
  • 5
  • 15
  • 3
    So it takes 20 seconds to parse 100 elements... how do you expect to get 1000x the throughput, parsing 1000 times as many elements in the same amount of time? Threading doesn't just magically give you free compute power or network bandwidth (we've no idea what's taking the time for the 100 elements). – Jon Skeet Jan 29 '17 at 19:21
  • @JonSkeet - but adding threads will allow them to access multiple CPU cores whereas doing it linearly will only use one core at the maximum. On a multi-core machine threading will corall more cores into the procedure. Agree completely the issue here is the time taken to parse 100 items which is massively excessive. – PhillipH Jan 29 '17 at 19:30
  • 1
    @PhillipH On a typical modern computer with, say, 8 cores, you would at an absolute maximum get a 7-8x speedup, not 1000x. As a side note, other things quickly become the bottleneck (like network or disk IO). You'd be better off profiling your code to find out _why_ your parsing takes that long and fix that instead. – Luke Briggs Jan 29 '17 at 19:34
  • @LukeBriggs - agreed. However Jon's comment to the OP was incorrect - multi-threading does give you "magic free computing power" which I thought was incorrect. The OP definitely needs to profile before optimising however. – PhillipH Jan 29 '17 at 19:39
  • 2
    @PhillipH: No, it doesn't give "magic free computing power" - it lets you make better use of your existing power. A factor of 7 or 8 would be reasonable - it's the factor of 1000 that is completely unreasonable, and is basically an expectation of magic. – Jon Skeet Jan 29 '17 at 19:43
  • So actually i understood correctly how it works. T1 Parse & Write Element 1 while T2 is waiting that T1 has finished during this time he's parsing Element 2 Etc.. What i need is T1 Parse & Write Element 1 and in the same time T2 Parse and Write Element 2. like if the application where open 2 Times or more. – Phil Jan 29 '17 at 20:01
  • @PhilippeMakzoume - Can you paste an example of the parsing code? Your question (in my understanding) is really about the speed of the parsing and yet you have not posted any of/examples of the parsing itself. – pstrjds Jan 29 '17 at 20:30
  • woops tought i already did that ! sorry just added =) – Phil Jan 29 '17 at 20:36
  • 1
    @PhilippeMakzoume Firstly, is that website ok with it being hit hard like this? Whilst it is possible to send 10,000 requests to a webserver within 20 seconds, most webmasters won't like you for it. If it's your server, I would highly recommend you add an endpoint which responds with all the items in a single request. If you can't use a page like that and the server is fine with being hammered with requests then use asynchronous HTTP instead. – Luke Briggs Jan 29 '17 at 20:46
  • @PhilippeMakzoume - You should consider using a [StringBuilder](https://msdn.microsoft.com/en-us/library/system.text.stringbuilder(v=vs.110).aspx) in place of the `result` in your parsing code. You are creating a lot of temporary `string` objects in that parsing code (in C# strings are immutable and repeated += results in new strings being allocated and the others abandoned. – pstrjds Jan 29 '17 at 20:46
  • everything is fine with collecting and writing the data the only issue left is that the thread are not writing all together from different web pages ! i will take in consideration the stringbuilder thanks =) – Phil Jan 29 '17 at 20:57

3 Answers3

0

Best solution for CPU bound work (such as parsing) is to launch as many threads as the numbers of the cores in your machine, less than that and you are not taking advantage of all of your cores, more than that and excessive context-switching might kick in and hinder performance.

So essentially, threadnbrs should be set to Environment.ProcessorCount

Also, consider using the Parallel class instead of creating threads yourself:

Parallel.ForEach(thingsToParse, (somethingToParse) =>
        {
            var parsed = Parse(somethingToParse);
            results.Add(parsed);
        });

You must agree that it looks much cleaner and much easier to maintain. Also, you'll be better off using ConcurrentBag instead of a regular List + lock as ConcurrentBag is more built for concurrent loads and could give you better performance.

areller
  • 4,800
  • 9
  • 29
  • 57
0

Finally ! Got it working by launching multiple process of my application Simultaneously.

Witch means if i have 10 k elements i run in 10 process of 1000 Elements. increase the numbers of process to decrease the number of elements and it goes faster and faster ! (I'm currently running on a very fast Internet Speed) and have a Samsung M.2 960 as Storage as well as core I7 Skylake 6 cores

Phil
  • 795
  • 1
  • 5
  • 15
0

Okay so i found "Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)" it's called "Thread Pool" I finally decided to download directly the XML file then parse the document directly offline, instead of parsing the website directly to get an SQL format. the new method work, i can download and write up to 10 000 K XML in only 9 seconds. I tried to push it to 150 K (All Websites Pages) but now i got a strange bug i got duplicates items... I'm going to try to rewrite the full code using the correct method for pools, multi Task/Thread, dictionary and IEnumerable Containers cross finger to work on 150 k Item without losing data in process and post back the full code.

Community
  • 1
  • 1
Phil
  • 795
  • 1
  • 5
  • 15