1

I have about 2600 massive xml files (~ 1gb each when decompressed) which are currently gzipped rather densely and stored on my SSD. These files contain between 23000 and 30000 records each.

I need to scape these records for a comparatively small amount of data for each record and persist that data to a db.

I've estimated (with some basic tests) that this will take at least 150 hours to do the scraping (I assume the persistence will be pretty quick because it's so much less data).

I'm not terribly familiar with .NET's IO methods and how to make them more efficient, so here's the methods I'm currently using to test:

 public PCCompounds DoStuff(String file)
    {
        using(FileStream fs = this.LoadFile(file))
        {
            using (GZipStream gz = this.Unzip(fs))
            {
                using (XmlReader xml = this.OpenFile(gz))
                {
                    return (PCCompounds)this.ParseXMLEntity(xml);
                }
            }
        }
    }

    private FileStream LoadFile(String file)
    {
        return new FileStream(file, FileMode.Open);
    }

    private GZipStream Unzip(FileStream file)
    {
        return new GZipStream(file, CompressionMode.Decompress);
    }

    private XmlReader OpenFile(GZipStream file)
    {
        return XmlReader.Create(file);
    }

    private Object ParseXMLEntity(XmlReader xml)
    {
        XmlSerializer serializer = new XmlSerializer(typeof(PCCompounds));

        return serializer.Deserialize(xml);
    }

Unfortunately, I have only found this on stackoverflow, and most of those answers were somewhat incomplete. I've also been through Sasha Goldstein's .NET performance book, but his section on Disk IO is a little thin.

Any suggestions would be greatly appreciated.

Community
  • 1
  • 1
dansan
  • 465
  • 5
  • 16
  • "Deserialization" of XML is usually called parsing. FWIW, I think a parsing speed of 1Gb/minute is realistically achievable, and anything faster than that is going to be challenging. – Michael Kay Aug 19 '13 at 16:24

1 Answers1

3

I need to scape these records for a comparatively small amount of data for each record and persist that data to a db.

Then I suggest you look at XmlReader. The API is very fiddly and more than a little awkward, and it will take you a bit of messing and debugging to get it reading right, but it will avoid a lot of issues; in particular:

  • you can skip sub-trees when you know you're not interested in them
  • you aren't instantiating objects you don't need
  • etc

Of course, for the bits you are interested in, if it is non-trivial you might want to create a sub-tree reader (which is an XmlReader scoped to a particular node in a parent XmlReader), and feed that to XmlSerializer, to offload the complex work to XmlSerializer (so you just do the "next, next, next; decide-to-skip; next; decide-to-deserialize-via-sub-tree", etc).

Ultimately, though; you're going to need to chew through all that IO, which will take some time. Personally, I'd raise a little flag that maybe, just maybe using xml isn't the best route going forward. Yes, it is what you have right now, but perhaps consider starting a project to change future output to something with less overhead.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Thanks. I am using XmlReader (as the code above shows) but I'm clearly using it naively. I'll take a look at this sub-tree reader concept you're suggesting. This data comes from an external source, though, so although not ideal, it's better than learning a completely new text-delimited format specific to Chemistry. – dansan Aug 19 '13 at 12:56
  • @dansan no, I'm saying "use `XmlReader` until you know you want the data", perhaps even reading all of it with `XmlReader` (and not using `XmlSerializer` *at all* unless the data is complex). Creating an `XmlReader` and passing the entire thing to `XmlSerializer` is not what I meant. – Marc Gravell Aug 19 '13 at 13:00
  • Oh, I see. I misunderstood. Appreciate the clarification. – dansan Aug 19 '13 at 13:06