I have a list of strings containing about 7 million items in a text file of size 152MB. I was wondering what could be best way to implement the a function that takes a single string and returns whether it is in that list of strings.
Asked
Active
Viewed 3,522 times
2 Answers
6
Are you going to have to match against this text file several times? If so, I'd create a HashSet<string>
. Otherwise, just read it line by line (I'm assuming there's one string per line) and see whether it matches.
152MB of ASCII will end up as over 300MB of Unicode data in memory - but in modern machines have plenty of memory, so keeping the whole lot in a HashSet<string>
will make repeated lookups very fast indeed.
The absolute simplest way to do this is probably to use File.ReadAllLines
, although that will create an array which will then be discarded - not great for memory usage, but probably not too bad:
HashSet<string> strings = new HashSet<string>(File.ReadAllLines("data.txt"));
...
if (strings.Contains(stringToCheck))
{
...
}

Jon Skeet
- 1,421,763
- 867
- 9,128
- 9,194
-
Actually I have to search again and again. But I am going to use this in a web application. Will memory become an issue with many requests? – Tasawer Khan Apr 19 '10 at 08:42
-
2@Taz: The number of request is irrelevant, as long as you build up your hashmap only once :) According to the documentation: *Any public static members of this type are thread safe*, so no problem here, too – tanascius Apr 19 '10 at 08:43
-
@Taz: tanascius is right. Load it up once and you should be able to search (using multiple concurrent threads, even - so long as nothing's writing to it) without any extra memory use. So long as your web server has enough memory to hold the set, that's the way to go. – Jon Skeet Apr 19 '10 at 08:45
-
what would you recommend for huge files of 2GB+ size? Load partial data at one time? – Nayan Apr 19 '10 at 09:03
-
2@Nayan: Use a proper database! – Jon Skeet Apr 19 '10 at 09:12