I have a list of files, which need to be read, in chunks, into a byte[]
, which is then passed to a hashing function. The tricky part is this: if I reach the end of a file, I need to continue reading the next file untill I fill the buffer, like so:
read 16 bits as an example:
File 1: 00101010
File 2: 01101010111111111
would need to be read as 0010101001101010
The point is: these files can be as large as several gigabytes, and I don't want to completely load them into memory. Loading pieces into a buffer of, like, 30 MB would be perfectly fine.
I want to use threading, but would it be efficient to thread reading a file? I don't know if Disc I/O is such a large bottleneck that this would be worth it. Would the hashing be sped up sufficently if I only thread that part, and lock on the read of each chunk? It is important the hashes get saved in the correct order.
The second thing I need to do, is to generate the MD5sum from each file as well. Is there anyway to do this more efficiently than doing this as a separate step?
(This question has some overlap with Is there a built-in way to handle multiple files as one stream?, but I thought this differed enough)
I am really stumped what approach to take, as I am fairly new to C#, as well as to threading. I already tried the approaches listed below, but they do not suffice for me.
As I am new to C# I value every kind of input on any aspect of my code.
This piece of code was threaded, but does not 'append' the streams, and as such generates invalid hashes:
public void DoHashing()
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = numThreads;
options.CancellationToken = cancelToken.Token;
Parallel.ForEach(files, options, (string f, ParallelLoopState loopState) =>
{
options.CancellationToken.ThrowIfCancellationRequested();
using (BufferedStream fileStream = new BufferedStream(File.OpenRead(f), bufferSize))
{
// Get the MD5sum first:
using (MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider())
{
md5.Initialize();
md5Sums[f] = BitConverter.ToString(md5.ComputeHash(fileStream)).Replace("-", "");
}
//setup for reading:
byte[] buffer = new byte[(int)pieceLength];
//I don't know if the buffer will f*ck up the filelenghth
long remaining = (new FileInfo(f)).Length;
int done = 0;
while (remaining > 0)
{
while (done < pieceLength)
{
options.CancellationToken.ThrowIfCancellationRequested();
//either try to read the piecelength, or the remaining length of the file.
int toRead = (int)Math.Min(pieceLength - done, remaining);
int read = fileStream.Read(buffer, done, toRead);
//if read == 0, EOF reached
if (read == 0)
{
remaining = 0;
break;
}
//offsets
done += read;
remaining -= read;
}
// Hash the piece
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
sha1.Initialize();
byte[] hash = sha1.ComputeHash(buffer);
hashes[f].AddRange(hash);
}
done = 0;
buffer = new byte[(int)pieceLength];
}
}
}
);
}
This other piece of code isn't threaded (and doesn't calculate MD5):
void Hash()
{
//examples, these got handled by other methods
List<string> files = new List<string>();
files.Add("a.txt");
files.Add("b.blob");
//....
long totalFileLength;
int pieceLength = Math.Pow(2,20);
foreach (string file in files)
{
totalFileLength += (new FileInfo(file)).Length;
}
//Reading the file:
long remaining = totalFileLength;
byte[] buffer = new byte[Math.min(remaining, pieceSize)];
int index = 0;
FileStream fin = File.OpenRead(files[index]);
int done = 0;
int offset = 0;
while (remaining > 0)
{
while (done < pieceLength)
{
int toRead = (int)Math.Min(pieceLength - offset, remaining);
int read = fin.Read(buffer, done, toRead);
//if read == 0, EOF reached
if (read == 0)
{
index++;
//if last file:
if (index > files.Count)
{
remaining = 0;
break;
}
//get ready for next round:
offset = 0;
fin.OpenRead(files[index]);
}
done += read;
offset += read;
remaining -= read;
}
//Doing the piece hash:
HashPiece(buffer);
//reset for next piece:
done = 0;
byte[] buffer = new byte[Math.min(remaining, pieceSize)];
}
}
void HashPiece(byte[] piece)
{
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
sha1.Initialize();
//hashes is a List
hashes.Add(sha1.ComputeHash(piece));
}
}
Thank you very much for your time and effort.
I'm not looking for completely coded solutions, any pointer and idea where to go with this would be excellent.
Questions & remarks to yodaj007's answer:
Why if (currentChunk.Length >= Constants.CHUNK_SIZE_IN_BYTES)
? Why not ==
? If the chunk is larger than the chunk size, my SHA1 hash gets a different value.
currentChunk.Sources.Add(new ChunkSource()
{
Filename = fi.FullName,
StartPosition = 0,
Length = (int)Math.Min(fi.Length, (long)needed)
});
Is a really interesting idea. Postpone reading untill you need it. Nice!
chunks.Add(currentChunk = new Chunk());
Why do this in the if (currentChunk != null)
block and in the for (int i = 0; i < (fi.Length - offset) / Constants.CHUNK_SIZE_IN_BYTES; i++)
block? Isn't the first a bit redundant?