18

Ok, I am reading in dat files into a byte array. For some reason, the people who generate these files put about a half meg's worth of useless null bytes at the end of the file. Anybody know a quick way to trim these off the end?

First thought was to start at the end of the array and iterate backwards until I found something other than a null, then copy everything up to that point, but I wonder if there isn't a better way.

To answer some questions: Are you sure the 0 bytes are definitely in the file, rather than there being a bug in the file reading code? Yes, I am certain of that.

Can you definitely trim all trailing 0s? Yes.

Can there be any 0s in the rest of the file? Yes, there can be 0's other places, so, no, I can't start at the beginning and stop at the first 0.

Kevin
  • 7,162
  • 11
  • 46
  • 70
  • 2
    the trailing nulls are probably from writing an entire buffer to file rather than just the used part of the buffer. I just had the same thing using MemoryStream.GetBuffer() rather than ToArray(). the former returns the entire buffer whereas the latter returns an array containing only the used part of the buffer. https://learn.microsoft.com/en-us/dotnet/api/system.io.memorystream.getbuffer?view=netframework-4.8 – more urgent jest Jan 30 '20 at 15:47
  • @moreurgentjest Interesting. A little late to be much help to me, but definitely a good point – Kevin Jan 30 '20 at 18:24

11 Answers11

25

I agree with Jon. The critical bit is that you must "touch" every byte from the last one until the first non-zero byte. Something like this:

byte[] foo;
// populate foo
int i = foo.Length - 1;
while(foo[i] == 0)
    --i;
// now foo[i] is the last non-zero byte
byte[] bar = new byte[i+1];
Array.Copy(foo, bar, i+1);

I'm pretty sure that's about as efficient as you're going to be able to make it.

Coderer
  • 25,844
  • 28
  • 99
  • 154
  • 1
    Only if you definitely have to copy the data :) One other option would be to treat it as an array of a wider type, e.g. int or long. That would probably require unsafe code, and you would have to deal with the end of the array separately if it had, say, an odd number of bytes (continued) – Jon Skeet Oct 27 '08 at 17:49
  • but it would probably be more efficient in the "finding" part. I *certainly* wouldn't start trying that until I'd proved it's the bottleneck though :) – Jon Skeet Oct 27 '08 at 17:49
  • 1
    You might want to add a minimum check in that `while`, or you're gonna end up trying to read index -1 if the array has nothing but 0 bytes. – Nyerguds Mar 15 '16 at 12:03
11

Given the extra questions now answered, it sounds like you're fundamentally doing the right thing. In particular, you have to touch every byte of the file from the last 0 onwards, to check that it only has 0s.

Now, whether you have to copy everything or not depends on what you're then doing with the data.

  • You could perhaps remember the index and keep it with the data or filename.
  • You could copy the data into a new byte array
  • If you want to "fix" the file, you could call FileStream.SetLength to truncate the file

The "you have to read every byte between the truncation point and the end of the file" is the critical part though.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
10

@Factor Mystic,

I think there is a shortest way:

var data = new byte[] { 0x01, 0x02, 0x00, 0x03, 0x04, 0x00, 0x00, 0x00, 0x00 };
var new_data = data.TakeWhile((v, index) => data.Skip(index).Any(w => w != 0x00)).ToArray();
Brian J Cardiff
  • 629
  • 4
  • 7
  • 1
    Interesting. Does anyone have any benchmarks to see how this compares to the 'raw' method? This isn't something I'd use LINQ for though. – Liam Dawson Nov 09 '11 at 12:10
  • 2
    Just tested this against @Coderer's solution, it's about 9 times slower – KVM Jan 05 '14 at 17:44
4

How about this:

[Test]
public void Test()
{
   var chars = new [] {'a', 'b', '\0', 'c', '\0', '\0'};

   File.WriteAllBytes("test.dat", Encoding.ASCII.GetBytes(chars));

   var content = File.ReadAllText("test.dat");

   Assert.AreEqual(6, content.Length); // includes the null bytes at the end

   content = content.Trim('\0');

   Assert.AreEqual(4, content.Length); // no more null bytes at the end
                                       // but still has the one in the middle
}
Rob
  • 1,983
  • 2
  • 20
  • 29
  • Treating it as text seems risky - plus you've just trebled the File IO. – Marc Gravell Oct 27 '08 at 15:37
  • Oh, and increased CPU etc significantly too (it takes time to do encoding/decoding, even for ASCII) – Marc Gravell Oct 27 '08 at 15:43
  • The encoding was just for the test... to write the sample file. Treating the file as text certainly might be an issue though. – Rob Oct 27 '08 at 15:49
  • That's not even a _byte_ array. It's a _char_ array. You realize you can just make a string out of that and trim the null characters off that without any file writes, right? `Char[] trimmed = new String(chars).Trim('\0').ToCharArray();` And that encoding messes up characters with a value greater than 0x80, so the size might not even match anyway. – Nyerguds Mar 15 '16 at 13:21
2

Assuming 0=null, that is probably your best bet... as a minor tweak, you might want to use Buffer.BlockCopy when you finally copy the useful data..

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
1

test this :

    private byte[] trimByte(byte[] input)
    {
        if (input.Length > 1)
        {
            int byteCounter = input.Length - 1;
            while (input[byteCounter] == 0x00)
            {
                byteCounter--;
            }
            byte[] rv = new byte[(byteCounter + 1)];
            for (int byteCounter1 = 0; byteCounter1 < (byteCounter + 1); byteCounter1++)
            {
                rv[byteCounter1] = input[byteCounter1];
            }
            return rv;
        }
A.Yaqin
  • 11
  • 1
  • Well, there's Array.Copy() for such bulk copy operations, so that may be more efficient, but for the rest you got the right idea. – Nyerguds Mar 15 '16 at 13:24
0

There is always a LINQ answer

byte[] data = new byte[] { 0x01, 0x02, 0x00, 0x03, 0x04, 0x00, 0x00, 0x00, 0x00 };
bool data_found = false;
byte[] new_data = data.Reverse().SkipWhile(point =>
{
  if (data_found) return false;
  if (point == 0x00) return true; else { data_found = true; return false; }
}).Reverse().ToArray();
Factor Mystic
  • 26,279
  • 16
  • 79
  • 95
  • I've post a shorter LINQ alternative in a separate answer. Hope you all like it. – Brian J Cardiff Oct 27 '08 at 18:20
  • 1
    If this is a big buffer, then it would be far more efficient to simply use the indexer backwards. Reverse() is a buffering operation, and has a performance cost. – Marc Gravell Oct 27 '08 at 22:15
0

You could just count the number of zero at the end of the array and use that instead of .Length when iterating the array later on. You could encapsulate this however you like. Main point is you don't really need to copy it into a new structure. If they are big, it may be worth it.

Greg Dean
  • 29,221
  • 14
  • 67
  • 78
0

if in the file null bytes can be valid values, do you know that the last byte in the file cannot be null. if so, iterating backwards and looking for the first non-null entry is probably best, if not then there is no way to tell where the actual end of the file is.

If you know more about the data format, such as there can be no sequence of null bytes longer than two bytes (or some similar constraint). Then you may be able to actually do a binary search for the 'transition point'. This should be much faster than the linear search (assuming that you can read in the whole file).

The basic idea (using my earlier assumption about no consecutive null bytes), would be:

var data = (byte array of file data...);
var index = data.length / 2;
var jmpsize = data.length/2;
while(true)
{
    jmpsize /= 2;//integer division
    if( jmpsize == 0) break;
    byte b1 = data[index];
    byte b2 = data[index + 1];
    if(b1 == 0 && b2 == 0) //too close to the end, go left
        index -=jmpsize;
    else
        index += jmpsize;
}

if(index == data.length - 1) return data.length;
byte b1 = data[index];
byte b2 = data[index + 1];
if(b2 == 0)
{
    if(b1 == 0) return index;
    else return index + 1;
}
else return index + 2;
luke
  • 14,518
  • 4
  • 46
  • 57
0

When the file is large (much larger than my RAM), I use this to remove trailing nulls:

static void RemoveTrailingNulls(string inputFilename, string outputFilename)
{
    int bufferSize = 100 * 1024 * 1024;
    long totalTrailingNulls = 0;
    byte[] emptyArray = new byte[bufferSize];

    using (var inputFile = File.OpenRead(inputFilename))
    using (var inputFileReversed = new ReverseStream(inputFile))
    {
        var buffer = new byte[bufferSize];

        while (true)
        {
            var start = DateTime.Now;

            var bytesRead = inputFileReversed.Read(buffer, 0, buffer.Length);

            if (bytesRead == emptyArray.Length && Enumerable.SequenceEqual(emptyArray, buffer))
            {
                totalTrailingNulls += buffer.Length;
            }
            else
            {
                var nulls = buffer.Take(bytesRead).TakeWhile(b => b == 0).Count();
                totalTrailingNulls += nulls;

                if (nulls < bytesRead)
                {
                    //found the last non-null byte
                    break;
                }
            }

            var duration = DateTime.Now - start;
            var mbPerSec = (bytesRead / (1024 * 1024D)) / duration.TotalSeconds;
            Console.WriteLine($"{mbPerSec:N2} MB/seconds");
        }

        var lastNonNull = inputFile.Length - totalTrailingNulls;

        using (var outputFile = File.Open(outputFilename, FileMode.Create, FileAccess.Write))
        {
            inputFile.Seek(0, SeekOrigin.Begin);
            inputFile.CopyTo(outputFile, lastNonNull, bufferSize);
        }
    }
}

It uses the ReverseStream class, which can be found here.

And this extension method:

public static class Extensions
{
    public static long CopyTo(this Stream input, Stream output, long count, int bufferSize)
    {
        byte[] buffer = new byte[bufferSize];
        long totalRead = 0;
        while (true)
        {
            if (count == 0) break;

            int read = input.Read(buffer, 0, (int)Math.Min(bufferSize, count));

            if (read == 0) break;
            totalRead += read;

            output.Write(buffer, 0, read);
            count -= read;
        }

        return totalRead;
    }
}
Fidel
  • 7,027
  • 11
  • 57
  • 81
-2

In my case LINQ approach never finished ^))) It's to slow to work with byte arrays!

Guys, why won't you use Array.Copy() method?

    /// <summary>
    /// Gets array of bytes from memory stream.
    /// </summary>
    /// <param name="stream">Memory stream.</param>
    public static byte[] GetAllBytes(this MemoryStream stream)
    {
        byte[] result = new byte[stream.Length];
        Array.Copy(stream.GetBuffer(), result, stream.Length);

        return result;
    }
Kirill
  • 19
  • 3
  • stream.GetArray() would be a better call to make in this instance as it does not return the whole memory buffer, only the data that has been written to the buffer. – Gusdor Dec 06 '11 at 12:18
  • ...that should be stream.ToArray(). My bad. Doesn't answer the question though. – Gusdor Dec 06 '11 at 12:27