Efficient way to parse large text file and work with data in it

Question

everyone. I want to parse 300+Mb text-file with 2.000.000+ lines in it and make some operations(split every line, make comparsions, save data in dict.) with stored data. It takes program ~50+ mins to get expected result (for files with 80.000 lines it takes about 15-20seconds) Is there any way to make it to work faster? Code sample below:

using (FileStream cut_file = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
            using (BufferedStream bs = new BufferedStream(cut_file))
            using (StreamReader s_reader = new StreamReader(bs)) {
            string line;
                while ((line = s_reader.ReadLine()) != null) {
                    string[] every_item = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
                    string car = every_item[0];
                    string[] cameras = every_item[1].Split(',');
                    if (!cars.Contains(car)) { //cars is List<string> defined at the beginning of programm
                        for (int camera = 0; camera < cameras.Count(); camera++) {
                            if (cams_input.Contains(cameras[camera])) { //cams_input is List<string> defined at the beginning of programm
                                cars.Add(car); 
                                result[myfile]++; //result is Dictionary<string, int>. Used dict. for parsing several files.
                            }
                        }
                    }
                }
            }

I think you have a memory issue. Open Task Manager while program is running and watch memory usage. If you don't have enough memory on the computer the data is put into swap space which is on a hard drive which will significantly slow down the application. Try running on a computer with more memory. — jdweng, Mar 13 '16 at 12:35
Two solution : you can try to write it async otherwise instead of reading line by line , try to solve via regex . — Mostafa, Mar 13 '16 at 12:39
You also should consider some one-pass parsing because String.Split is very inefficient when used multiple times over basically the same string. Also, you can use `HashSet` for `cams_input` and `cars`. — Eugene Podskal, Mar 13 '16 at 12:40
Very much related - http://stackoverflow.com/questions/7153315/how-to-parse-a-text-file-in-c-sharp-and-be-io-bound — Eugene Podskal, Mar 13 '16 at 12:50
And this - http://stackoverflow.com/questions/8037070/whats-the-fastest-way-to-read-a-text-file-line-by-line — Eugene Podskal, Mar 13 '16 at 12:52
You are getting a lot a bad answers if you really have a memory issue. The suggestions use more memory than your current code. You are simply adding items. Using a hash will only give improvements if you are trying to query the data, but use more memory when adding. Right now you have no queries. — jdweng, Mar 13 '16 at 13:08
@jdweng While your are absolutely right that memory can be an issue, I'd like to point that "text-file with 2.000.000+ lines" where each line is a car doesn't bode well for performance of `cars.Contains(car); ... cars.Add(car); ` where "//cars is List" — Eugene Podskal, Mar 13 '16 at 14:04
Lets not put the cart before the horse. Lets get the immediate issue solved first "why the program is running slow". Adding hashes will just use more memory and slow down then application further. Lets find out if the issue is really memory or something else. — jdweng, Mar 13 '16 at 15:09
@jdweng, well, memory usage is ~13-16mb (cpu ~9-11%). I guess it isn't a memory issue, is it? I tried to use HashSet-s instead of List-s and now elapsed time is about ~1-2seconds (thanks to Eugene Podskal and A. Chiesa), but I feel, that I can't duplicate items in HashSet - am I right? — DocC, Mar 13 '16 at 16:13
Exactly, you cannot duplicate items in an HashSet. It's the whole point of the HashSet: to have one single item in the set, for each value. — Alberto Chiesa, Mar 13 '16 at 18:00
Not necessarily. Hash will look at only one field as primary index and in many cases a primary can have more than one row that isn't a duplicate. — jdweng, Mar 13 '16 at 20:20
@jdweng please, don't give out of context information. In one HashSet you cannot have duplicate items. Stop. Then, if you want to discuss implementation details and how an hash could be shared among more than one item, feel free to do it. But please don't generate confusion. — Alberto Chiesa, Mar 13 '16 at 20:27
A hash can be a > which allows duplicates, just not duplicate keys. — jdweng, Mar 14 '16 at 09:10
I guess the speed issue is with your add method. I assumed the add was the standard List<>(add) method, but looking at the code I can't tell. — jdweng, Mar 14 '16 at 09:15

Alberto Chiesa · Accepted Answer · 2016-03-13T18:06:32.470

Well, it is quite possible you have a problem linked to memory use. However, you have some blatant inefficiencies in useless Linq usage: when you call Contains() on a List, you basically do a foreach on the List.

So, an improvement over your code is to use HashSet instead of List in order to speed up the Contains().

Same for calling Count() on the array in the for loop. It's an array, so just call Array.Length.

Anyway, you should profile the code in your machine (I use the JetBrains Profiler and find it invaluable to do this kind of performance profiling).

My take on this:

        string myfile = "";
        var cars = new HashSet<string>();
        var cams_input = new HashSet<string>();
        var result = new Dictionary<string, int>();
        foreach (var line in System.IO.File.ReadLines(myfile, System.Text.Encoding.UTF8))
        {
            var everyItem = line.Split('|'); //line sample: jdsga237 | 3332, 3223, 121 |
            var car = everyItem[0];
            if (cars.Contains(car)) continue;

            var cameras = everyItem[1].Split(',');

            for (int camera = 0; camera < cameras.Length; camera++)
            {
                if (cams_input.Contains(cameras[camera]))
                {
                    cars.Add(car);
                    // I really don't get who is inserting value zero.
                    result[myfile]++;
                }
            }
        }

Edit: As per your comment, the performance seemed to be related to the use of lists. You should read a guide about collections available in the .Net framework, like this: http://www.codethinked.com/an-overview-of-system_collections_generic Every single type is best suited for a type of task: the HashSet, for example, is meant to be used to store a set (doh!) of unique values, and the really shiny feat it gives you is O(1) Contains operations. What you pay is the storage of the hashes, and computation of them. You lose also sorting, etc.

A Dictionary is basically an HashSet with a value attached to each key.

Good study!

Ps: if the problem is solved, please close the question.

Thank you. Now it works faster. I didn't know about HashSet (I am a c# beginner). Are there any pitfalls of HashSet which I should know about? — DocC, Mar 13 '16 at 16:27

Efficient way to parse large text file and work with data in it

1 Answers1