I have about 2600 massive xml files (~ 1gb each when decompressed) which are currently gzipped rather densely and stored on my SSD. These files contain between 23000 and 30000 records each.
I need to scape these records for a comparatively small amount of data for each record and persist that data to a db.
I've estimated (with some basic tests) that this will take at least 150 hours to do the scraping (I assume the persistence will be pretty quick because it's so much less data).
I'm not terribly familiar with .NET's IO methods and how to make them more efficient, so here's the methods I'm currently using to test:
public PCCompounds DoStuff(String file)
{
using(FileStream fs = this.LoadFile(file))
{
using (GZipStream gz = this.Unzip(fs))
{
using (XmlReader xml = this.OpenFile(gz))
{
return (PCCompounds)this.ParseXMLEntity(xml);
}
}
}
}
private FileStream LoadFile(String file)
{
return new FileStream(file, FileMode.Open);
}
private GZipStream Unzip(FileStream file)
{
return new GZipStream(file, CompressionMode.Decompress);
}
private XmlReader OpenFile(GZipStream file)
{
return XmlReader.Create(file);
}
private Object ParseXMLEntity(XmlReader xml)
{
XmlSerializer serializer = new XmlSerializer(typeof(PCCompounds));
return serializer.Deserialize(xml);
}
Unfortunately, I have only found this on stackoverflow, and most of those answers were somewhat incomplete. I've also been through Sasha Goldstein's .NET performance book, but his section on Disk IO is a little thin.
Any suggestions would be greatly appreciated.