0

How does StreamReader read all chars, including 0x0D 0x0A chars?

I have an old .txt file I am trying to covert. Many lines (but not all) end with "0x0D 0x0D 0x0A".

This code reads all of the lines.

StreamReader srFile = new StreamReader(gstPathFileName);
while (!srFile.EndOfStream) {
    string stFileContents = srFile.ReadLine();
    ...
}

This results in extra "" strings between each .txt line. As there are some blank lines between the paragraphs, removing all "" strings removes those blank lines.

Is there a way to have StreamReader read all of the chars including the "0x0D 0x0D 0x0A"?


Edited two hours later ... the file is huge, 1.6MB.

ttom
  • 985
  • 3
  • 12
  • 21
  • I think reimplementing the ReadLine() is the best idea. If the file is very small you could read it all and then `string.Split` it by 0x0d 0x0a and trim the optional 0x0d at the end of each line – xanatos Feb 28 '15 at 17:52
  • 3
    End-of-line detecting in StreamReader is hard-coded, you can't tinker with it. Fixing the file with a text editor is surely the most pragmatic solution. – Hans Passant Feb 28 '15 at 18:03
  • 0x0D as text or as byte? – Gabe Feb 28 '15 at 18:26
  • StreamReader already reads these sequences of chars. 0x0D (`\r`) and 0x0D 0x0A (`\r\n`) are different forms of line breaks which can be both processed by StreamReader. So when it reads `\r` and no `\n` after it interpret this as line break and returns a result from ReadLine. When you call ReadLine next time it sees `\r\n` and return an empty string because there are no other symbols between previous `\r` and current `\r\n`. So if you want to translate 0x0D 0x0D 0x0A to a single line break then fix the file as @Hans Passant says. – Yoh Deadfall Feb 28 '15 at 18:27
  • I've got an implementation I think will work for your case below. It reads the file as bytes then interprets them, returning a new line when it encounters `0x0d 0x0d 0x0a`. n.b.1 - 1.6MB is far from huge (e.g. [this SO question](http://stackoverflow.com/q/846475/1364007)). n.b.2 - why does your file even have `0d0d0a` as line endings? – Wai Ha Lee Feb 28 '15 at 20:29
  • The file has 0D0D0A as endings as it is extracted from a database via code that was written more than 20 years ago. – ttom Feb 28 '15 at 21:13

4 Answers4

1

A very simple reimplementation of ReadLine. I have done a version that returns an IEnumerable<string> because it's easier. I've put it in an extension method, so the static class. The code is heavily commented, so it should be easy to read.

public static class StreamEx
{
    public static string[] ReadAllLines(this TextReader tr, string separator)
    {
        return tr.ReadLines(separator).ToArray();
    }

    // StreamReader is based on TextReader
    public static IEnumerable<string> ReadLines(this TextReader tr, string separator)
    {
        // Handling of empty file: old remains null
        string old = null;

        // Read buffer
        var buffer = new char[128];

        while (true)
        {
            // If we already read something
            if (old != null)
            {
                // Look for the separator
                int ix = old.IndexOf(separator);

                // If found
                if (ix != -1)
                {
                    // Return the piece of line before the separator
                    yield return old.Remove(ix);

                    // Then remove the piece of line before the separator plus the separator
                    old = old.Substring(ix + separator.Length);

                    // And continue 
                    continue;
                }
            }

            // old doesn't contain any separator, let's read some more chars
            int read = tr.ReadBlock(buffer, 0, buffer.Length);

            // If there is no more chars to read, break the cycle
            if (read == 0)
            {
                break;
            }

            // Add the just read chars to the old chars
            // note that null + "somestring" == "somestring"
            old += new string(buffer, 0, read);

            // A new "round" of the while cycle will search for the separator
        }

        // Now we have to handle chars after the last separator

        // If we read something
        if (old != null)
        {
            // Return all the remaining characters
            yield return old;
        }
    }
}

Note that, as written, it won't directly handle your problem :-) But it lets you select the separator you want to use. So you use "\r\n" and then you trim the excess '\r'.

Use it like this:

using (var sr = new StreamReader("somefile"))
{
    // Little LINQ to strip excess \r and to make an array
    // (note that by making an array you'll put all the file
    // in memory)
    string[] lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r')).ToArray();
}

or

using (var sr = new StreamReader("somefile"))
{
    // Little LINQ to strip excess \r
    // (note that the file will be read line by line, so only
    // a line at a time is in memory (plus some remaining characters
    // of the next line in the old buffer)
    IEnumerable<string> lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r'));

    foreach (string line in lines)
    {
        // Do something
    }
}
xanatos
  • 109,618
  • 12
  • 197
  • 280
0

You could always use a BinaryReader and manually read in lines a byte at a time. Keep hold of the bytes, then when you come across 0x0d 0x0d 0x0a, make a new string of the bytes for the current line.

Note:

  • I'm assuming that your encoding is Encoding.UTF8 but your case might be different. Accessing bytes directly, I don't know off-hand how to interpret the encoding.
  • If your file has extra information, e.g. a byte order mark, that will be returned too.

Here it is:

public static IEnumerable<string> ReadLinesFromStream(string fileName)
{
    using ( var fileStream = File.Open(gstPathFileName) )
    using ( BinaryReader binaryReader = new BinaryReader(fileStream) )
    {
        var bytes = new List<byte>();
        while ( binaryReader.PeekChar() != -1 )
        {
            bytes.Add(binaryReader.ReadByte());

            bool newLine = bytes.Count > 2
                && bytes[bytes.Count - 3] == 0x0d
                && bytes[bytes.Count - 2] == 0x0d
                && bytes[bytes.Count - 1] == 0x0a;

            if ( newLine )
            {
                yield return Encoding.UTF8.GetString(bytes.Take(bytes.Count - 3).ToArray());
                bytes.Clear();
            }
        }

        if ( bytes.Count > 0 )
            yield return Encoding.UTF8.GetString(bytes.ToArray());
    }
}
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
0

This code works well ... reads every char.

char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
    acBuf = new char[iReadLength];
    srFile.Read(acBuf, 0, iReadLength);
    string s = new string(acBuf);
}
ttom
  • 985
  • 3
  • 12
  • 21
0

A very easy solution (not optimized for memory consumption) could be:

var allLines = File.ReadAllText(gstPathFileName)
    .Split('\n');

The if you need to remove trailing carriage return characters, then do:

for(var i = 0; i < allLines.Length; ++i)
    allLines[i] = allLines[i].TrimEnd('\r');

You can put relevant processing into that for link if you want. Or if you do not want to keep the array, use this instead of the for:

foreach(var line in allLines.Select(x => x.TrimEnd('\r')))
{
    // use 'line' here ...
}
Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181