2

I have a large file >200MB. The file is an CSV-file from an external party, but sadly I cannot just read the file line by line, as \r\n is used to define a new line.

Currently I am reading in all the lines using this approach:

var file = File.ReadAllText(filePath, Encoding.Default);
var lines = Regex.Split(file, @"\r\n");

for (int i = 0; i < lines.Length; i++)
{
    string line = lines[i];
    ...
}

How can I optimize this? After calling ReadAllText on my 225MB file, the process is using more than 1GB RAM. Is it possible to use a streaming approach in my case, where I need to split the file using my \r\n pattern?

EDIT1: Your solutions using the File.ReadLines and a StreamReader will not work, as it sees each line in the file as one line. I need to split the file using my \r\n pattern. Reading the file using my code results in 758.371 lines (which is correct), whereas a normal line counts results in more than 1.5 million.

SOLUTION

public static IEnumerable<string> ReadLines(string path)
{
    const string delim = "\r\n";

    using (StreamReader sr = new StreamReader(path))
    {
        StringBuilder sb = new StringBuilder();

        while (!sr.EndOfStream)
        {
            for (int i = 0; i < delim.Length; i++)
            {
                Char c = (char)sr.Read();
                sb.Append(c);

                if (c != delim[i])
                    break;

                if (i == delim.Length - 1)
                {
                    sb.Remove(sb.Length - delim.Length, delim.Length);
                    yield return sb.ToString();
                    sb = new StringBuilder();
                    break;
                }
            }
        }

        if (sb.Length>0)
            yield return sb.ToString();
    }
}
dhrm
  • 14,335
  • 34
  • 117
  • 183
  • 1
    As many have pointed out before, `\r\n` is the default newline for windows environments. Are you on something else than windows? – flindeberg Oct 26 '12 at 12:28

5 Answers5

6

You can use File.ReadLines which returns IEnumerable<string> instead of loading whole file to memory.

foreach(var line in File.ReadLines(@filePath, Encoding.Default)
                        .Where(l => !String.IsNullOrEmpty(l)))
{
}
L.B
  • 114,136
  • 19
  • 178
  • 224
  • @DennisMadsen you can skip the empty lines. – L.B Oct 26 '12 at 11:43
  • It is not about empty lines, it is about lines in the input format is not ending by a new line in the file but first when a '\r\n' is seen. – dhrm Oct 26 '12 at 12:00
  • 2
    @DennisMadsen I can not understand you. `\r\n` actually means *new line*. Can you post your a few lines to some location like pastebin. Try this (**`var chars = Environment.NewLine.ToCharArray();`**) – L.B Oct 26 '12 at 12:05
  • @L.B He's environment is probably not windows, and the only way to reset Environment.NewLine is using reflection as far as I know. Do you know a better way? – flindeberg Oct 26 '12 at 12:29
  • @flindeberg No need to reset NewLine. File.ReadLines will read it correctly even if the NewLine char would be only `\r` – L.B Oct 26 '12 at 12:34
4

using StreamReader it will be easy.

using (StreamReader sr = new StreamReader(path)) 
 {
      foreach(string line = GetLine(sr)) 
      {
           //
      }
 }


    IEnumerable<string> GetLine(StreamReader sr)
    {
        while (!sr.EndOfStream)
            yield return new string(GetLineChars(sr).ToArray());
    }

    IEnumerable<char> GetLineChars(StreamReader sr)
    {
        if (sr.EndOfStream)
            yield break;
        var c1 = sr.Read();
        if (c1 == '\\')
        {
            var c2 = sr.Read();
            if (c2 == 'r')
            {
                var c3 = sr.Read();
                if (c3 == '\\')
                {
                    var c4 = sr.Read();
                    if (c4 == 'n')
                    {
                        yield break;
                    }
                    else
                    {
                        yield return (char)c1;
                        yield return (char)c2;
                        yield return (char)c3;
                        yield return (char)c4;
                    }
                }
                else
                {
                    yield return (char)c1;
                    yield return (char)c2;
                    yield return (char)c3;
                }
            }
            else
            {
                yield return (char)c1;
                yield return (char)c2;
            }
        }
        else
            yield return (char)c1;
    }
Rohit
  • 3,610
  • 7
  • 45
  • 76
  • 1
    @DennisMadsen Have tried to answer your question but your requirement is weird :) – Rohit Oct 26 '12 at 12:44
  • +1 for answering the question. Not sure why it wasn't accepted. Maybe because it's less general-purpose? (I came here looking for an efficient way to apply regex find/replace to large text files, so this doesn't seem to help me more than ReadAllText would.) – Jon Coombs Apr 15 '14 at 18:42
0

Use StreamReader to read file line by line:

using (StreamReader sr = new StreamReader(filePath))
{
  while (true)
  {
    string line = sr.ReadLine();
    if (line == null)
      break;
  }
}
Rohit
  • 3,610
  • 7
  • 45
  • 76
tozka
  • 3,211
  • 19
  • 23
0

How about

        StreamReader sr = new StreamReader(path);
        while (!sr.EndOfStream)
        {
                string line = sr.ReadLine();
        }

Using the stream reader approach means the whole file won't get loaded into memory.

Justin Harvey
  • 14,446
  • 2
  • 27
  • 30
  • I see, then have a look at this thread, I would try diEmAll's approach... http://stackoverflow.com/questions/9873097/c-sharp-streamreader-readline-for-custom-delimiters – Justin Harvey Oct 26 '12 at 12:06
0

This was my lunch break :)

Set MAXREAD to the amount of data you want in memory if for example using a foreach since I'm using yield return. Use the code at your own risk, I've tried it on smaller sets of data :)

Your usage would be something like:

foreach (var row in StreamReader(FileName).SplitByChar(new char[] {'\r','\n'}))
{
  // Do something awesome! :)
}    

And the extension method like this:

public static class FileStreamExtensions
{
    public static IEnumerable<string> SplitByChar(this StreamReader stream, char[] splitter)
    {
        int MAXREAD = 1024 * 1024;

        var chars = new List<char>(MAXREAD);

        var bytes = new char[MAXREAD];
        var lastStop = 0;
        var read = 0;

        while (!stream.EndOfStream)
        {
            read = stream.Read(bytes, 0, MAXREAD);
            lastStop = 0;

            for (int i = 0; i < read; i++)
            {
                if (bytes[i] == splitter[0])
                {
                    var assume = true;
                    for (int p = 1; p < splitter.Length; p++)
                    {
                        assume &= splitter[p] == bytes[i + p];
                    }

                    if (assume)
                    {
                        chars.AddRange(bytes.Skip(lastStop).Take(i - lastStop));

                        var res = new String(chars.ToArray());
                        chars.Clear();
                        yield return res;

                        i += splitter.Length - 1;
                        lastStop = i + 1;
                    }
                }
            }
            chars.AddRange(bytes.Skip(lastStop));
        }

        chars.AddRange(bytes.Skip(lastStop).Take(read - lastStop));
        yield return new String(chars.ToArray());
    }
}
flindeberg
  • 4,887
  • 1
  • 24
  • 37