Read Large Text File Line by Line and add them into List of Objects

Question

The text file has 10000 lines. I've been tried using File.ReadLine and StreamReader but it seems pretty slow. Here's my code

foreach (var line in File.ReadLines(ofd.FileName))
            {
                if (analysisDatas.All(analysisData =>!string.Equals(analysisData.Text, line, StringComparison.CurrentCultureIgnoreCase)))
                {
                    var item = new AnalysisData { Text = line };
                    analysisDatas.Add(item);
                }
            }

Is there a more efficient way to add them into my list of objects?

What about using [File.ReadAllLines](https://msdn.microsoft.com/en-us/library/system.io.file.readalllines(v=vs.110).aspx)? — Erik Philips, Mar 19 '18 at 20:35
"seems pretty slow". How slow is pretty slow? 2s? 10s? 2 minutes? An hour? — Joel Coehoorn, Mar 19 '18 at 20:35
@ErikPhilips That's likely to make things much worse, because it will try to load the entire file contents into memory before you can use any of it, and it will keep the entire file contents in memory until you've finished with all of it. — Joel Coehoorn, Mar 19 '18 at 20:35
@JoelCoehoorn That depends on a number of variables we have not been told (where the data is, local/remote etc). — Erik Philips, Mar 19 '18 at 20:36
Instead of a list of objects, how about a HashSet, and not checking if it's already in your list already? — C. Helling, Mar 19 '18 at 20:36
@ErikPhilips it won't work https://stackoverflow.com/a/21970035/6709247 — userrrrrrr, Mar 19 '18 at 20:38
@userrrrrrr Actually works great on a SSD Array. Again it depends on variables you haven't given. — Erik Philips, Mar 19 '18 at 20:39
@C.Helling In this case i need an indexer. is it good if converting back from hashset into the list? — userrrrrrr, Mar 19 '18 at 20:39
@userrrrrrr I think there too many details missing to answer. Why do you need an indexer? Do you need the order it appears in the file? I just wanted to point out how inefficient it would be to check `analysisDatas.All` for a line of text 10000 times. — C. Helling, Mar 19 '18 at 20:43
@ErikPhilips It's not about SSD vs HDD. ReadAllLines() is almost certainly causing an OutOfMemoryException. — Joel Coehoorn, Mar 19 '18 at 20:45

spender · Accepted Answer · 2018-03-19T20:55:49.190

2

You're iterating your new collection (with .All) on every pass of the loop, leading to some pretty nasty slow-down as the number of items increases.

Here's one way that might show better performance characteristics:

File
    .ReadLines(filePath)
    .Distinct(StringComparer.CurrentCultureIgnoreCase)
    .Select(line => new AnalysisData { Text = line })
    .ToList()

edited Mar 19 '18 at 20:55

answered Mar 19 '18 at 20:46

spender

117,338
33
229
351

1

That's quite the roundabout way of getting the distinct values of a sequence, especially given that there's an operation that does literally just that. – Servy Mar 19 '18 at 20:53
You need to keep the comparer in there – Servy Mar 19 '18 at 20:55
Keep em coming :) – spender Mar 19 '18 at 20:55

Joel Coehoorn · Answer 2 · 2018-03-19T20:57:31.293

0

If you can get a good key for each line, I suggest using a HashSet<T> rather than All() to check each line. A simple/naive example might look like this:

var lineKeys = new HashSet<int>();
foreach (var line in File.ReadLines(ofd.FileName))
{
    int hash = line.ToUpper().GetHashCode();
    if (linesKeys.Add(hash) || analysisDatas.All(analysisData =>!string.Equals(analysisData.Text, line, StringComparison.CurrentCultureIgnoreCase)))
    {
         var item = new AnalysisData { Text = line };
         analysisDatas.Add(item);
    }
}

Note I said, "If". Comparing via hashcode and the ToUpper() method is not exactly the same as StringComparison.CurrentCultureIgnoreCase. Some cultures have characters that need special handling based on accents or similar. This might be a problem in your situation, but it might not... you'll have to look at your data and evaluate your needs. Don't short yourself on that evaluation.

Also note my use of int for the HashSet. I could just put the string there. However, then we end up storing two sets of data in memory for each line: the original line string in the analysisDates colletion, and the upper case string in the HashSet. Even if comparisons in the HashSet are only done via the HashCode values, the full version of the string would be stored, too. This allows the GC to collect the uppercase versions of the string. Since there have already been OutOfMemoryException issues, I opted to take a hit on potential wrong-matches in order to save memory.

edited Mar 19 '18 at 20:57

answered Mar 19 '18 at 20:42

Joel Coehoorn

399,467
113
570
794

1

Why in the world are you storing the hash in the database, instead of the item itself? That just breaks the code for any collisions. – Servy Mar 19 '18 at 20:48
what would best best hashing implementation for `line` string? – Akash KC Mar 19 '18 at 20:49
1

@Servy I went back and forth on that, and based on the info that he's already had out-of-memory issues, I opted against storing the full string in the HashSet. But yes, because of that, it might be wise to also add the full check back that only needs to run in cases where `Add()` returns false. – Joel Coehoorn Mar 19 '18 at 20:51
@JoelCoehoorn Why in the world would you re-implement the behavior that `HashSet` already provides and that *you removed yourself*. If you want the hashes to be checked, and then the values themselves checked after any hash matches *you just use a regular hashset, because that's how hashes work*. – Servy Mar 19 '18 at 20:53
@userrrrrrr This solution is broken. It's removing lines that don't match other lines, because it's making the false assumption that the hash code for each string is unique, which can't be true. – Servy Mar 19 '18 at 20:55
@Servy Added the original `All()` check back as a safety, but in a way where it doesn't need to run every time. – Joel Coehoorn Mar 19 '18 at 20:55
@JoelCoehoorn Again, why are you going out of your way to *specifically* remove the functionality that `HashSet` provides, only to then try and add it back in through a much slower, inefficient, and unnecessary means? Just *use an actual hashset* when you want to have a set of items, and let the type do the job it was designed to do. – Servy Mar 19 '18 at 20:58
@Servy because the OP mentioned this is using 4GB of RAM. That's dangerously close to OOM territory. The current version of the question is _safe_ (it does the full check when needed) and optimized relative to the original question (it only does the full check _when needed_, instead of for every line). – Joel Coehoorn Mar 19 '18 at 21:04
2

@JoelCoehoorn Storing the actual string references would take *no more space than storing the hash*. Your solution saves nothing on space, but is much more work to maintain, unnecessarily, doesn't handle collisions, and as you yourself note, doesn't work for lots of different character sets. All because you are opposed to storing the string in hashset for...what reason? Again, this is all stuff that `HashSet` does for you (although more effectively). You're not *avoiding* anything, you're just doing the work yourself (but not as well). – Servy Mar 19 '18 at 21:07

Read Large Text File Line by Line and add them into List of Objects

2 Answers2