How to load a file with words into a list where the file has over 3 million lines

Question

Is it possible to load a file with 3 or 4 million lines in less than 1 second (1.000000)? One line contains one word. Words range in length from 1 - 17 (does that matter?).

My code is now:

List<string> LoadDictionary(string filename)
{
    List<string> wordsDictionary = new List<string>();

    Encoding enc = Encoding.GetEncoding(1250);//I need ę ą ć ł etc.
    using (StreamReader r = new StreamReader(filename, enc))
    {
        string line = "";
        while ((line = r.ReadLine()) != null)
        {
            if (line.Length > 2)
            {
                wordsDictionary.Add(line);
            }
        }
    }

    return wordsDictionary;
}

Results of timed execution:

time of loading 4 million words - pic result

How can I force the method to make it execute in half the time?

Do you have access to a profiler? Without knowing where the bottleneck is (disk i/o, function call overhead, memory management, etc.) it's hard to know how to speed it up. — Adrian McCarthy, Nov 09 '11 at 00:12
You could write variations to see where the bottleneck is. Can you simply read the file in under a second without adding the words? Can you add three or four million words to the dictionary without reading them from a file? — Adrian McCarthy, Nov 09 '11 at 00:14
@AdrianMcCarthy with declared capacity `new List( 3500000 )` (3 millions and half) whole process reaches almost 1 second. See answer under @brian519. Without `wordsDictionary.Add(line);` the time is: `0.4848208` — deadfish, Nov 09 '11 at 17:07

score 5 · Accepted Answer · answered Nov 08 '11 at 23:25

5

If you know that your list will be large, you should set a good starting capacity.

List<string> wordsDictionary = new List<string>( 100000 );

If you don't do this, the list will need to keep increasing its capacity which takes a bit of time. Likely won't cut this down by half, but it's a start

answered Nov 08 '11 at 23:25

brian519

318
2
10

1

+1: Since he knows that his range will be three to four million, he should probably give it a starting capacity in the millions. – StriplingWarrior Nov 08 '11 at 23:28
FYI: I just tried this on some sample data, and the effect was negligible. I/O is really the slow point here. – StriplingWarrior Nov 09 '11 at 00:01
1

This is a good guess, but without profiling, it's hard to know if that's the bottleneck. – Adrian McCarthy Nov 09 '11 at 00:13
@brian519 [after declaring size to 3 and half of million - picture](http://i.stack.imgur.com/XVUbf.png) It gives now 1.0160538 :) Nice, I didn't realize it could make it faster. – deadfish Nov 09 '11 at 16:40
Good point. I did some experimentation: On my machine roughly 1/3 of the time is spent reading data out of the file. 2/3 is in setting values in the array. The CPU barely flinches. So RAM evidently plays a bigger part than I thought. – StriplingWarrior Nov 09 '11 at 17:00

score 4 · Answer 2 · answered Nov 09 '11 at 00:35

4

How does File.ReadAllLines() and some LINQ perform?

public List<string> LoadDictionary(string filename)
{
    List<string> wordsDictionary = new List<string>();
    Encoding enc = Encoding.GetEncoding(1250);
    string[] lines = File.ReadAllLines(filename,enc);
    wordsDictionary.AddRange(lines.Where(x => x.Length > 2));
    return wordsDictionary;
}

answered Nov 09 '11 at 00:35

p.campbell

98,673
67
256
322

it returns `1.1970612`, it is slower than mine. But +1 for LINQ :) – deadfish Nov 09 '11 at 16:34
@Cooldown4seconds how much does it change on the 2nd, 3rd and nth run? i.e. I think we're dealing with the disk being the bottleneck. Do you have an SSD that you can run this on, for example? – p.campbell Nov 09 '11 at 16:41
In my testing, `File.ReadAllLines` takes about 90% as long all by itself as the original method. Once you add the AddRange, it's pretty much the same. Good suggestion, though: this is much cleaner. – StriplingWarrior Nov 09 '11 at 16:53
@p.campbell I'm running it on `Intel Core 2 Quad CPU Q6600 @ 2.40GHz 2.40GHz, Installed memory (RAM) 8GB, System type x64 W7`. Next results are: `1.1940683` `1.1814641` `1.1924561`. However this application should work on school's computers, slower than mine. They do not have x64. – deadfish Nov 09 '11 at 16:53
@Cooldown4seconds OK that's cool. Disk is the bottleneck; CPU and memory notsomuch. – p.campbell Nov 09 '11 at 16:57

score 1 · Answer 3 · answered Nov 08 '11 at 23:26

1

Your biggest performance hit at this point is probably just from pulling data off the hard drive and into memory. It's unlikely that you can do anything to get it to go much faster, short of getting better hardware.

answered Nov 08 '11 at 23:26

StriplingWarrior

151,543
27
246
315

score 0 · Answer 4 · answered Nov 09 '11 at 16:56

Profile. Profile. Profile.

We can all guess at where the time is spent and then propose other methods that may be faster. Some of us may even have a good intuition or get lucky and stumble on the right answer. But it's going to be much more productive to measure, iterate, and measure again.

Raymond Chen did an interesting series on loading a Chinese/English dictionary and getting the load time fast. It's not exactly the same (his does character conversion and some simple parsing, and the dictionary was a bit smaller) and it's in a different language. But I recommend the series anyway, because it shows the right way to optimize something like this: profile, profile, profile.

@Cooldown 4 seconds: A profiler is a tool that watches your program run and tells you which parts of the code are taking the most time. When optimizing, you need to know which parts are slow. If 90% of the time is spent allocating memory for the list, but you work on speeding up the disk access, then you're wasting your effort. Profiling helps you avoid wasting your time by letting you focus on the part that matters. — Adrian McCarthy, Nov 11 '11 at 00:20

How to load a file with words into a list where the file has over 3 million lines

4 Answers4