24

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

values = line.Split(delimiter);

where line is the a string that holds the values that are seperated by the delimiter.

Measuring the performance of my ReadNextRow method I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.

Thanks!

  • - I know the exact contents of the CSV files so I don't have to worry about escaping characters, etc.. - I used dotTrace by JetBrains for Profiling. - I actually use Code Project CsvReader in other parts of my code - Performance is important in this piece of code, which was the reason I asked –  Feb 20 '09 at 12:29
  • Thanks for all the replies. Sorry my comment did not out right since this comment field seems to ignore new lines. –  Feb 20 '09 at 12:30

14 Answers14

25

The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.

But there's one thing you can do and that's to implement this as a generator:

public static IEnumerable<string> GetSplit( this string s, char c )
{
    int l = s.Length;
    int i = 0, j = s.IndexOf( c, 0, l );
    if ( j == -1 ) // No such substring
    {
        yield return s; // Return original and break
        yield break;
    }

    while ( j != -1 )
    {
        if ( j - i > 0 ) // Non empty? 
        {
            yield return s.Substring( i, j - i ); // Return non-empty match
        }
        i = j + 1;
        j = s.IndexOf( c, i, l - i );
    }

    if ( i < l ) // Has remainder?
    {
        yield return s.Substring( i, l - i ); // Return remaining trail
    }
}

The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.

The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.

Community
  • 1
  • 1
John Leidegren
  • 59,920
  • 20
  • 131
  • 152
  • Apparently, there is no need to save memory, but there is a need to save CPU. – Dave Van den Eynde Feb 20 '09 at 10:10
  • 1
    @Dave Van den Eynde - I think it's important to do both! But yeah, memory optimization are greatly overlooked by most programmers. – John Leidegren Feb 20 '09 at 10:18
  • 3
    I did an approach similar to this, and it was slower than the existing algorithm that used Split, but because we were processing such large strings (multiple Megabytes), it saved about 30% on the RAM consumption. – torial Apr 18 '09 at 03:53
  • 1
    You know, that code isn't optimized, and the reason why string.Split is faster is because it uses unsafe code. If you include that here, the running time is the same. Except this is a lot more memory efficent. – John Leidegren Apr 19 '09 at 07:22
  • I know this is old but I thought I would point out that this solution appears to be removing empty items from the returned collection. Calling "1,,3".GetSplit(',') returns a collection containing only 2 items. A 1 and a 3. This is different behavior then .net's split method. – Gary Brunton Feb 06 '14 at 05:18
  • Yeah, that's by design. If you don't want that behavior by default you'd have to adjust the code. – John Leidegren Feb 06 '14 at 18:09
18

It should be pointed out that split() is a questionable approach for parsing CSV files in case you come across commas in the file eg:

1,"Something, with a comma",2,3

The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.

That being said, split() is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split() creates lots of temporary objects.

So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.

The algorithm for that is relatively simple:

  • Stop at every comma;
  • When you hit quotes continue until you hit the next set of quotes;
  • Handle escaped quotes (ie \") and arguably escaped commas (\,).

Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll() on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.

So if you really want performance, it's time to hand parse.

Or, better yet, use someone else's optimized solution like this fast CSV reader.

By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll() vs a hand-coded loop: Putting char into a java string for each N characters.

Community
  • 1
  • 1
cletus
  • 616,129
  • 168
  • 910
  • 942
  • I've linked an answer on a similar topic about string replace methods, you'll find the link at the end of my own answer to this question. – John Leidegren Feb 20 '09 at 10:17
  • I just wanted to say thanks. You reaffirmed what I thought, and forced me to go through my code again and look through where I was being inefficient. Turns out I had a conditional statement in the wrong order, and I think I would have just called it a day without seeing your post. – kemiller2002 Mar 03 '11 at 22:21
  • In excel generated csv, escaped quotes are "", not \" – Rado Jul 18 '16 at 02:02
  • What about nowadays and Span? – Akmal Salikhov Jun 29 '19 at 10:45
6

Here's a very basic example using ReadOnlySpan. On my machine this takes around 150ns as opposed to string.Split() which takes around 250ns. That's a nice 40% improvement right there.

string serialized = "1577836800;1000;1";
ReadOnlySpan<char> span = serialized.AsSpan();

Trade result = new Trade();

index = span.IndexOf(';');
result.UnixTimestamp = long.Parse(span.Slice(0, index));
span = span.Slice(index + 1);

index = span.IndexOf(';');
result.Price = float.Parse(span.Slice(0, index));
span = span.Slice(index + 1);

index = span.IndexOf(';');
result.Quantity = float.Parse(span.Slice(0, index));

return result;

Note that a ReadOnlySpan.Split() will soon be part of the framework. See https://github.com/dotnet/runtime/pull/295

4

Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.

Hope this helps

3

I found this implementation which is 30% faster from Dejan Pelzel's blog. I qoute from there:

The Solution

With this in mind, I set to create a string splitter that would use an internal buffer similarly to a StringBuilder. It uses very simple logic of going through the string and saving the value parts into the buffer as it goes along.

public int Split(string value, char separator)
{
    int resultIndex = 0;
    int startIndex = 0;

    // Find the mid-parts
    for (int i = 0; i < value.Length; i++)
    {
        if (value[i] == separator)
        {
            this.buffer[resultIndex] = value.Substring(startIndex, i - startIndex);
            resultIndex++;
            startIndex = i + 1;
        }
    }

    // Find the last part
    this.buffer[resultIndex] = value.Substring(startIndex, value.Length - startIndex);
    resultIndex++;

    return resultIndex;

How To Use

The StringSplitter class is incredibly simple to use as you can see in the example below. Just be careful to reuse the StringSplitter object and not create a new instance of it in loops or for a single time use. In this case it would be better to juse use the built in String.Split.

var splitter = new StringSplitter(2);
splitter.Split("Hello World", ' ');
if (splitter.Results[0] == "Hello" && splitter.Results[1] == "World")
{
    Console.WriteLine("It works!");
}

The Split methods returns the number of items found, so you can easily iterate through the results like this:

var splitter = new StringSplitter(2);
var len = splitter.Split("Hello World", ' ');
for (int i = 0; i < len; i++)
{
    Console.WriteLine(splitter.Results[i]);
}

This approach has advantages and disadvantages.

zx485
  • 28,498
  • 28
  • 50
  • 59
orellabac
  • 2,077
  • 2
  • 26
  • 34
1
    public static unsafe List<string> SplitString(char separator, string input)
    {
        List<string> result = new List<string>();
        int i = 0;
        fixed(char* buffer = input)
        {
            for (int j = 0; j < input.Length; j++)
            {
                if (buffer[j] == separator)
                {
                    buffer[i] = (char)0;
                    result.Add(new String(buffer));
                    i = 0;
                }
                else
                {
                    buffer[i] = buffer[j];
                    i++;
                }
            }
            buffer[i] = (char)0;
            result.Add(new String(buffer));
        }
        return result;
    }
Basil
  • 51
  • 2
1

You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.

You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.

One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.

If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.

I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.

Dave

Dave Van den Eynde
  • 17,020
  • 7
  • 59
  • 90
0

This is my solution:

Public Shared Function FastSplit(inputString As String, separator As String) As String()
        Dim kwds(1) As String
        Dim k = 0
        Dim tmp As String = ""

        For l = 1 To inputString.Length - 1
            tmp = Mid(inputString, l, 1)
            If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
            kwds(k) &= tmp
        Next

        Return kwds
End Function

Here is a version with benchmarking:

Public Shared Function FastSplit(inputString As String, separator As String) As String()
        Dim sw As New Stopwatch
        sw.Start()
        Dim kwds(1) As String
        Dim k = 0
        Dim tmp As String = ""

        For l = 1 To inputString.Length - 1
            tmp = Mid(inputString, l, 1)
            If tmp = separator Then k += 1 : tmp = "" : ReDim Preserve kwds(k + 1)
            kwds(k) &= tmp
        Next
        sw.Stop()
        Dim fsTime As Long = sw.ElapsedTicks

        sw.Start()
        Dim strings() As String = inputString.Split(separator)
        sw.Stop()

        Debug.Print("FastSplit took " + fsTime.ToString + " whereas split took " + sw.ElapsedTicks.ToString)

        Return kwds
End Function

Here are some results on relatively small strings but with varying sizes, up to 8kb blocks. (times are in ticks)

FastSplit took 8 whereas split took 10

FastSplit took 214 whereas split took 216

FastSplit took 10 whereas split took 12

FastSplit took 8 whereas split took 9

FastSplit took 8 whereas split took 10

FastSplit took 10 whereas split took 12

FastSplit took 7 whereas split took 9

FastSplit took 6 whereas split took 8

FastSplit took 5 whereas split took 7

FastSplit took 10 whereas split took 13

FastSplit took 9 whereas split took 232

FastSplit took 7 whereas split took 8

FastSplit took 8 whereas split took 9

FastSplit took 8 whereas split took 10

FastSplit took 215 whereas split took 217

FastSplit took 10 whereas split took 231

FastSplit took 8 whereas split took 10

FastSplit took 8 whereas split took 10

FastSplit took 7 whereas split took 9

FastSplit took 8 whereas split took 10

FastSplit took 10 whereas split took 1405

FastSplit took 9 whereas split took 11

FastSplit took 8 whereas split took 10

Also, I know someone will discourage my use of ReDim Preserve instead of using a list... The reason is, the list really didn't provide any speed difference in my benchmarks so I went back to the "simple" way.

0

As others have said, String.Split() will not always work well with CSV files. Consider a file that looks like this:

"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG

(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)

This CSV reading framework will deal with all of that, and is also very efficient:

LumenWorks.Framework.IO.Csv by Sebastien Lorien

codeulike
  • 22,514
  • 29
  • 120
  • 167
0

Some very thorough analysis on String.Slit() vs Regex and other methods.

We are talking ms savings over very large strings though.

Charlie
  • 2,096
  • 2
  • 22
  • 34
  • Normally I like .Net Perls, but I think their comparison is unfair. If you know you are going to use a Regex a lot, you compile it and extract it from the loop. You'll get some big reductions on the overall time using that strategy. – torial Apr 18 '09 at 03:58
  • Article is deleted, this is a archived version of article on dotnetperls.com : http://web.archive.org/web/20090316210342/http://dotnetperls.com/Content/String-Split-Benchmark.aspx – Stanislav Prusac Jan 26 '17 at 10:37
  • It's back on dotnetperls: https://www.dotnetperls.com/split My findings: 10000000 Regex.split's are 10% slower than 10000000 string.Split's (.net framework 4) – OzBob Feb 16 '17 at 09:20
0

The main problem(?) with String.Split is that it's general, in that it caters for many needs.

If you know more about your data than Split would, it can make an improvement to make your own.

For instance, if:

  1. You don't care about empty strings, so you don't need to handle those any special way
  2. You don't need to trim strings, so you don't need to do anything with or around those
  3. You don't need to check for quoted commas or quotes
  4. You don't need to handle quotes at all

If any of these are true, you might see an improvement by writing your own more specific version of String.Split.

Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.

The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.

However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
0

CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.

The ODBC solution recommended above looks at first glance to be basically the same approach.

I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that need it is one of the trickiest to deal with in my experience.

kpollock
  • 3,899
  • 9
  • 42
  • 61
-1

You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • It will cause string data copies once you need to pass a StringShim to something that accepts a string. Unless your whole app works with shims instead. – Dave Van den Eynde Feb 20 '09 at 10:11
  • You can't assume that at all. I'll dig up the example using regex vs hand-coding where the regex solution was an order of magnitude slower. – cletus Feb 20 '09 at 10:12
  • Here it is http://stackoverflow.com/questions/537174/putting-char-into-a-java-string-for-each-n-characters – cletus Feb 20 '09 at 10:14
  • My point is that it's hard to be faster *with the same interface*. My StringShim solution is quite explicit changing the split() interface to make things faster. – MSalters Feb 20 '09 at 13:31
  • Almost every .NET function is designed for multiple-case scenarios, thus if you can be certain of the data, you can build a tailored function that will always perform better than the default .NET implementation. I downvoted your answer because reinventing the wheel is not always a bad thing, despite what the internet would like to see you regurgitate. – Krythic Sep 21 '16 at 19:19
-2

String.split is rather slow, if you want some faster methods, here you go. :)

However CSV is much better parsed by a rule based parser.

This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)

http://www.csdgn.org/code/rule-tokenizer

private static final String[] fSplit(String src, char delim) {
    ArrayList<String> output = new ArrayList<String>();
    int index = 0;
    int lindex = 0;
    while((index = src.indexOf(delim,lindex)) != -1) {
        output.add(src.substring(lindex,index));
        lindex = index+1;
    }
    output.add(src.substring(lindex));
    return output.toArray(new String[output.size()]);
}

private static final String[] fSplit(String src, String delim) {
    ArrayList<String> output = new ArrayList<String>();
    int index = 0;
    int lindex = 0;
    while((index = src.indexOf(delim,lindex)) != -1) {
        output.add(src.substring(lindex,index));
        lindex = index+delim.length();
    }
    output.add(src.substring(lindex));
    return output.toArray(new String[output.size()]);
}
Echilon
  • 10,064
  • 33
  • 131
  • 217
chase
  • 11