How to count occurences of a char in a big (5+ GBs) file using C#?

Question

To provide some context, I'm trying to optimize the below code which reads a file line by line, buffers those lines and saves to the database every 100 lines -

using (StreamReader sr = new StreamReader(fileName, Encoding.Default)) 
{
    IList<string> list = new List<string>();
    int lineCount = 0;
    foreach (var line in sr.ReadLines((char)someEOL)) //ReadLines is an extension method that yield returns lines based on someEOL while reading character by character 
    {
        list.Add(line); //Keeping it simple for this example. In the actual code it goes through a bunch of operations
        if(++lineCount % 100 == 0) { //Will not work if the total number of lines is not a multiple of 100
            SaveToDB(list);
            list = new List<string>();
        }     
    }
    if(list.Count() > 0)
        SaveToDB(list); //I would like to get rid of this. This is for the case when total number of lines is not a multiple of 100.   
}

As you would notice, SaveToDB(list) happens twice in the above code. It is needed the second time in case total number of lines % 100 != 0 (for example, if there are 101 lines, the if(lineCount % 100 == 0) will miss the last one). It's not a huge bother but I'm wondering if I can get rid of it.

To that end, if I could read the total number of lines before getting in the foreach loop, I could write if(lineCount % 100 == 0) differently. But finding the total number of lines requires going through the file character by character to count someEOL which is a definite no because the file size can range from 5-20 GBs. Is there a way to do the count without a performance penalty (which seems doubtful to me but maybe there is a solution)? Or another way to rewrite it to get rid of that extra SaveDB(list) call?

Very confusing what you have problem with - refactoring code can be done with LINQ batching (http://stackoverflow.com/questions/13731796/create-batches-in-linq), but title asks for counting character - so batching can't be answer to your question... — Alexei Levenkov, Feb 05 '17 at 04:19
@AlexeiLevenkov But how would LINQ be able to create batches in this scenario? — Achilles, Feb 05 '17 at 04:23
The same way as with any other enumerable... Not really sure what you are confused about? — Alexei Levenkov, Feb 05 '17 at 04:24
Ok I might be missing something. My understanding is that it would need to know the total number of items in advance to be able to do the count for each batch (and specially the last batch). Is it true? — Achilles, Feb 05 '17 at 04:27
Sorry, I read it just now. It is basically the same logic as mine. Both have that if condition at the end to get the items that represent the totalitems % 100 != 0 problem. — Achilles, Feb 05 '17 at 04:40
Your link does hint at at a refactoring idea that the buffering can potential be moved to ReadLines instead (which is probably more suitable/logical place for it). — Achilles, Feb 05 '17 at 04:45

Ghasan غسان · Accepted Answer · 2017-02-05T04:35:57.810

2

Your code looks fine, except for creating new empty lists each time a 100-line is read. Anyways, you might want to try this approach:

var enumerator = sr.ReadLines((char)someEOL).GetEnumerator();
isValid = true;

for (int i = 1; isValid; i++)
{
    bool isValid = enumerator.MoveNext();

    if (isValid)
    {
        list.Add(enumerator.Current);
    }

    if (i % 100 ==  0 || (!isValid && list.Count() > 0))
    {
        SaveToDB(list);

        // It is better to clear the list than creating new one for each iteration, given that your file is big.
        list.Clear();
    }
}

edited Feb 05 '17 at 04:35

answered Feb 05 '17 at 04:31

Ghasan غسان

5,577
4
33
44

Can you please clarify how this code solves refactoring to avoid second call to `SaveToDB`? – Alexei Levenkov Feb 05 '17 at 04:32
I updated my answer. I am not sure why you are uneasy about calling it two times, it is perfectly normal is such situations. – Ghasan غسان Feb 05 '17 at 04:36
@GhasanAl-Sakkaf Ok, if that's the case. I've been just looking for a confirmation that there's no other clever logic that can resolve it better. – Achilles Feb 05 '17 at 04:42
You can use the above code to call it one time, but I personally prefer a more readable code. I think your code is really fine. – Ghasan غسان Feb 05 '17 at 04:44

score 1 · Answer 2 · answered Feb 05 '17 at 08:12

I think you're looking for StreamReader.Peek()

sr.Peek().Equals(-1)

Code:

        string filepath = "myfile.txt";
        int lineCount = 0;
        List<string> list = new List<string>();
        using (StreamReader sr = File.OpenText(filepath))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                lineCount++;
                if (lineCount % 100 == 0 || sr.Peek().Equals(-1))
                { 
                    SaveToDB(list);
                    list = new List<string>();
                }
            }
        }

How to count occurences of a char in a big (5+ GBs) file using C#?

2 Answers2