0

I have a txt file. Right now, I need to load it line by line, and check how many times a '@' are in the entire file.

So, basically, I have a single line string, how to get the # of occurances of '@' fast?

I need to count this fast since we have lots of files like this and each of them are about 300-400MB.

I searched, it seems the straightforward way is the fastest way to do this:

int num = 0;
foreach (char c in line)
{
    if (c == '@') num++;
}

Is there a different method that could be faster than this? Any other suggestions?

  • if needed, we do not have to load the txt file line by line, but we do need to know the # lines in each file.

Thanks

urlreader
  • 6,319
  • 7
  • 57
  • 91
  • 1
    I'd reopen this because duplicate is about strings (another matter) and not specifically asking for a **fast ** search for characters – Adriano Repetti Jun 14 '17 at 16:34
  • 1
    No need to go line by line, load it all into memory. You are looking at it char-by-char which is likely the fastest way to do it so you can also look for and count \n to total the lines within the same loop. – Alex K. Jun 14 '17 at 16:34
  • if you want speed, foreach seems the fastest. consider @AlexK.'s suggestion. OR, you could use `for` loop instead, cache the length of yourstring, that is: `for(int i = 0, l = yourString.Length; i < l; i++)` – xGeo Jun 14 '17 at 16:41
  • Loading an entire 400MB file in memory is not truly memory wise, but certainly is faster than the line-by-line (or block-by-block) counterpart. – Federico Dipuma Jun 14 '17 at 16:49
  • 1
    Memory is there to be used, imo the only reason not to load that much data (for a quick operation) is if you can reliably anticipate it will cause problems. – Alex K. Jun 14 '17 at 16:55
  • If you load it into memory and then immediately discard it, it makes little difference (as far as memory usage goes) whether you did it in a single chunk or one line at a time. However, you gain in speed by avoiding the extra overhead of processing each line. If you were going to need the contents of the file again, you should use a memory-mapped file. – Cody Gray - on strike Jun 14 '17 at 17:19
  • Since you're looking for a single character, there *are* ways of parallelizing this search, processing "chunks" of characters at a time and only stopping to process each character within that "chunk" if it contains a potential match. The running time is still *O(n)*, but you've decreased *n* slightly. This is probably not worth it in this case, though, since the overhead of the required file I/O will dwarf any speed-up you'd see from this parallelization. It would also make the code significantly more complex. – Cody Gray - on strike Jun 14 '17 at 17:21
  • It's all (and only) about the I/O. Your simple for loop is more than good enough. What matters is how fast you can get those `line`s into memory. – H H Jun 14 '17 at 17:27
  • I can go ahead and tell you right now that foreach is the exact opposite of fast. – Krythic Jun 14 '17 at 17:58

3 Answers3

4

The fastest approach is really bound to I/O capabilities and computational speed. Usually the best method to understand what is the fastest technique is to benchmark them.

Disclaimer: Results are (of course) bound to my machine and may vary significantly on different hardware. For testing I have used a single text file of about 400MB in size. If interested the file may be downloaded here (zipped). Executable compiled as x86.

Option 1: Read entire file, no parallelization

long count = 0;

var text = File.ReadAllText("C:\\tmp\\test.txt");
for(var i = 0; i < text.Length; i++)
if (text[i] == '@')
    count++;

Results:

  • Average execution time: 5828 ms
  • Average process memory: 1674 MB

This is the "naive" approach, which reads the entire file in memory and then uses a for loop (which is significantly faster than foreach or LINQ).

As expected process occupied memory is very high (about 4 times the file size), this may be caused by a combination of string size in memory (more info here) and string processing overhead.

Option 2: Read file in chunks, no parallelization

long count = 0;
using(var file = File.OpenRead("C:\\tmp\\test.txt"))
using(var reader = new StreamReader(file))
{
    const int size = 500000; // chunk size 500k chars
    char[] buffer = new char[size];

    while(!reader.EndOfStream)
    {
        var read = await reader.ReadBlockAsync(buffer, 0, size); // read chunk

        for(var i = 0; i < read; i++)
        if(buffer[i] == '@')
            count++;
    }
}

Results:

  • Average execution time: 4819 ms
  • Average process memory: 7.48 MB

This was unexpected. In this version we are reading the file in chunks of 500k characters instead of loading it entirely in memory, and execution time is even lower than the previous approach. Please note that reducing the chunk size will increase execution time (because of the overhead). Memory consumption is extremely low (as expected, we are only loading roughly 500kB/1MB in memory directly into a char array).

Better (or worse) performance may be obtained by changing the chunk size.

Option 3: Read file in chunks, with parallelization

long count = 0;
using(var file = File.OpenRead("C:\\tmp\\test.txt"))
using(var reader = new StreamReader(file))
{
    const int size = 2000000; // this is roughly 4 times the single threaded value
    const int parallelization = 4; // this will split chunks in sub-chunks processed in parallel
    char[] buffer = new char[size];

    while(!reader.EndOfStream)
    {
        var read = await reader.ReadBlockAsync(buffer, 0, size);

        var sliceSize = read/parallelization;
        var counts = new long[parallelization];

        Parallel.For(0, parallelization, i => {
            var start = i * sliceSize;
            var end = start + sliceSize;

            if(i == parallelization)
                end += read % parallelization;

            long localCount = 0;
            for(var j = start; j < end; j++)
            {
                if(buffer[(int)j] == '@')
                    localCount++;
            }
            counts[i] = localCount;
        });

        count += counts.Sum();
    }
}

Results:

  • Average execution time: 3363 ms
  • Average process memory: 10.37 MB

As expected this version performs better the the single threaded one, but not 4 times better as we could have thought. Memory consumption is again very low compared to the first version (same considerations as before) and we are taking advantage of multi-core environments.

Parameters like chunk size and number of parallel tasks may significantly change the results, you should just go by trial and error to find what is the best combination for you.

Conclusions

I was inclined to think that the "load everything in memory" version was the fastest, but this really depends on the overhead of string processing and I/O speed. The parallel-chunked approach seems the fastest in my machine, this should lead you to an idea: when in doubt just benchmark it.

Federico Dipuma
  • 17,655
  • 4
  • 39
  • 56
  • _"used a single text file of ..."_ could of course skew the results due to caching. How did you get your "Average execution time" ? At the least you should repeat the tests and discard the first result. – H H Jun 15 '17 at 13:11
  • @HenkHolterman That's what I did. Compiled an executable for each version which repeats the operation 11 times, discarded the first: `averageTime = sum/10`. – Federico Dipuma Jun 15 '17 at 13:31
1

You can test if it's faster, but a shorter way to write it would be:

int num = File.ReadAllText(filePath).Count(i => i == '@');

Hmm, but I just saw you need the line count as well, so this is similar. Again, would need to be compared to what you have:

var fileLines = File.ReadAllLines(filePath);
var count = fileLines.Length();
var num = fileLines.Sum(line => line.Count(i => i == '@'));
Rufus L
  • 36,127
  • 5
  • 30
  • 43
  • I think this would be the best solution, splitting it into lines at all seems unnecessary. Cause you could count for `\n` characters after you count for `c` – Loufs Jun 14 '17 at 17:15
  • Why the ToCharArray() ? Doesn't seem necessary. And it probably has to copy all chars, a relatively big cost in time and space. – H H Jun 14 '17 at 17:24
  • 1
    At root other approaches like this are going to end up iterating over the chars, only with additional baggage involved. – Alex K. Jun 14 '17 at 17:32
  • @AlexK. - so far the only answer to mention ReadAllLines(), the key point of this answer. – H H Jun 14 '17 at 18:03
-2

You could use pointers. I don't know if this would be any faster though. You would have to do some testing:

static void Main(string[] args)
{
    string str = "This is @ my st@ing";
    int numberOfCharacters = 0;

    unsafe
    {
        fixed (char *p = str)
        {
            char *ptr = p;
            while (*ptr != '\0')
            {
                if (*ptr == '@')
                    numberOfCharacters++;
                ptr++;
            }
        }
    }

    Console.WriteLine(numberOfCharacters);
}

Note that you must go into your project properties and allow unsafe code in order for this code to work.

Icemanind
  • 47,519
  • 50
  • 171
  • 296
  • 1
    The question is steering you in this (wrong) direction but the only thing that matters is the I/O. – H H Jun 14 '17 at 17:26
  • 1
    Why do you totally ignore (and circumvent) the I/O ? Besides, the `fixed` takes its toll on the GC so I'm not so sure this is going to shine with 400MB of text. Even when using pointers this is still C#, not C. – H H Jun 14 '17 at 18:35
  • @HenkHolterman - His question mentions nothing about I/O problems. His question asked the fastest way to count characters in a string. That was the extent of his question. Based on debugging timers, my method is the quickest way to count characters in a string. He asked nothing about I/O or what he plans to do with the data. – Icemanind Jun 14 '17 at 19:29
  • "we have lots of files like this and each of them are about 300-400MB". Try pinning a few 400MB blocks with your solution. – H H Jun 15 '17 at 06:39