0

I would like to parse an XML file that uses the following schema and extract the data in the two elements "adif" and "name" and place them in a Dictionary. I really have no clue on how to go about this using any built in .net classes or HTML Agility Pack.

Can someone please send me in the right direction? Thanks

<?xml version="1.0" encoding="utf-16"?>
 <xs:schema xmlns="http://www.clublog.org/cty/v1.0" attributeFormDefault="unqualified"         elementFormDefault="qualified" targetNamespace="http://www.clublog.org/cty/v1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="clublog">
    <xs:complexType>
      <xs:sequence>

        <xs:element name="entities">
          <xs:complexType>
            <xs:sequence>
              <xs:element maxOccurs="unbounded" name="entity">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="adif" type="xs:decimal" />
                    <xs:element name="name" type="xs:string" />
                    <xs:element name="prefix" type="xs:string" />
                    <xs:element name="deleted" type="xs:boolean" />
                    <xs:element name="cqz" type="xs:unsignedByte" />
                    <xs:element name="cont" type="xs:string" />
                    <xs:element name="long" type="xs:decimal" />
                    <xs:element name="lat" type="xs:decimal" />
                    <xs:element minOccurs="0" name="start" type="xs:dateTime" />
                    <xs:element minOccurs="0" name="end" type="xs:dateTime" />
                    <xs:element minOccurs="0" name="whitelist" type="xs:boolean" />
                    <xs:element minOccurs="0" name="whitelist_start" type="xs:dateTime" />
                    <xs:element minOccurs="0" name="whitelist_end" type="xs:dateTime" />
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>

        <xs:element name="exceptions">
          <xs:complexType>

I am not interested in anything other than the entities node. There are at max about 400 of these whereas the exceptions and in the 10's of thousands. The code that I have so far is

using (WebClient wc = new WebClient())
{
     wc.DownloadFile("https://secure.clublog.org/cty.php?api="API","Test.gz");

           var doc = new HtmlAgilityPack.HtmlDocument();

           using (var file = File.Open("Test.gz", FileMode.Open))
           using (var zip = new GZipStream(file, CompressionMode.Decompress))
           {
               doc.Load(zip);
           }

            Dictionary<string, string> dict = new Dictionary<string, string>();

And that's it. Of course HTML Agility pack has no documentation and my understanding of parsing XML code is limited.

This is where I am at: XD contains valid xml data.

    private void button1_Click(object sender, EventArgs e)
    {
        var dict = (Dictionary<string, decimal>)null;
        using (WebClient wc = new WebClient())
        {

            wc.DownloadFile("https://secure.clublog.org/cty.php?api=", "Test.gz");


            using (var file = File.Open("Test.gz", FileMode.Open))
            {
                using (var zip = new GZipStream(file, CompressionMode.Decompress))
                {

                    using (var xmlReader = XmlReader.Create(zip))
                    {
                        //                            Dictionary<string, decimal> dict = new Dictionary<string, decimal>();

                        var xd = XDocument.Load(xmlReader);


                    }

SO here is the xml data....two records. I tried to save the file on my server and it would not let me...

<?xml version="1.0" encoding="utf-8" ?>
-<clublog xmlns="http://www.clublog.org/cty/v1.0" date="2014-03-16T08:30:03+00:00">
  -<entities>
-<entity>
  <adif>1</adif>
  <name>CANADA</name>
  <prefix>VE</prefix>

  <deleted>FALSE</deleted>

   <cqz>5</cqz>

 <cont>NA</cont>

  <long>-80.00</long>

  <lat>45.00</lat>

</entity>


-<entity>

  <adif>2</adif>

  <name>ABU AIL IS</name>

<prefix>A1</prefix>

<deleted>TRUE</deleted>

<cqz>21</cqz>

<cont>AS</cont>

<long>45.00</long>
<lat>12.80</lat>
<end>1991-03-30T23:59:59+00:00</end>

Tom
  • 527
  • 1
  • 8
  • 28

2 Answers2

0

Jeenkies. I just wrote a nice set answer to another issue just like this. If you can use .NET 3.5, you can use linq-to-xml, which will make this tremendously easier.

Let's get started. First you need to load your document. Look here and here for some help with that. The second one I think will help you more though.

Now comes the digging. Since you are interested in nodes that maybe only a few layers deep, this shouldn't be too painful. At this point we are met with 2 designs (that I can think of) chipping away layer by layer and blasting it into tiny pieces. Since you are dealing with a rather large amount of data, chipping might be faster, it might not. So I will include both designs and let you test it from there.

This design will be assuming that doc represents the entire xml document.

Chipping method:

var elements = doc.Elements(xs:element).Where(el => el.Attribute("name").Value == "entities");

From there it should be a simple matter of using combinations of Elements() and Attributes().

Blasting method just replaces Elements() with Descendants(). If you are dealing with near-root level nodes, I'd just stick to the chipping method.

Now comes putting it into a Dictionary. This should point you in the right direction. It sure came in handy for me.

Community
  • 1
  • 1
bubbinator
  • 487
  • 2
  • 7
0

Something like this should work for you:

var dict = (Dictionary<string, decimal>)null;
using (WebClient wc = new WebClient())
{
    var text = wc.DownloadString(
        "https://secure.clublog.org/cty.php?api=" + API);
    using (var stream = new MemoryStream(Encoding.UTF8.GetBytes(text)))
    {
        using (var zip = new GZipStream(stream, CompressionMode.Decompress))
        {
            using (var xmlReader = XmlReader.Create(zip))
            {
                var xd = XDocument.ReadFrom(xmlReader);
                dict =
                xd
                    .Document
                    .Root
                    .Element(XName.Get("entities", "http://www.clublog.org/cty/v1.0"))
                    .Elements(XName.Get("entity", "http://www.clublog.org/cty/v1.0"))
                    .ToDictionary(
                        x => x.Element(XName.Get("name", "http://www.clublog.org/cty/v1.0")).Value,
                        x => (decimal)x.Element(XName.Get("adif", "http://www.clublog.org/cty/v1.0")));
            }
        }
    }
}

I made the assumption that you actually wanted a Dictionary<string, decimal> given the type of "adif", but if I'm wrong it should be easy to change.

My approach avoids all the mucking around with files.

Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • Hi,I tried you code and at the line using (var xmlReader = XmlReaderCreate(zip) I receive an error of InvalidDataException "The magic number in GZip header is not correrct). If I try to use my file based code then I receive an error about a magic number. – Tom Mar 16 '14 at 14:28
  • Oops sorry. If I use my filew based code then I receive an error that the xmlready must be interactive? Thanks – Tom Mar 16 '14 at 14:56
  • @Tom I think the trouble that you are having with both of these solutions lies in the xml data being UTF-16. As far as I can tell, you must convert it, UTF-8 is ideal. You can read [this post](http://stackoverflow.com/a/1033807/3317555) for some more information. – bubbinator Mar 16 '14 at 18:03
  • Hi,perhaps but my code that uses a file works fine. If I replace var xd = XDocument.ReadFrom(xmlReader) with var xd = XDOcument.Load(xmlReader) I get rid of the error that the xmlreader must be interactive. At that point I have xd which contains a perfectly valid uncompressed xml document. However at that point, the above code does not work. It tells me there is a null reference in the dict = xd.... – Tom Mar 16 '14 at 18:39
  • @Tom - Try changing the memory stream encoding to "UTF16". – Enigmativity Mar 16 '14 at 20:24
  • Hi, Nope that is not an option for memory stream. Only UTF7, UTF32 all of which do not work. If I use file IO and use var xd = XDocument.Load(zip) then I have valid XML in xd. However, then as I step into dict = xd...... I get Null reference exception. – Tom Mar 16 '14 at 20:50
  • @Tom - You need to post some of your XML for me to be able to debug the issue. – Enigmativity Mar 16 '14 at 22:54
  • @Tom - Also, "Unicode" is the "UTF16" format. Try that one. – Enigmativity Mar 16 '14 at 22:56
  • Hi, Actually the issue is not the file at all. I can write the decompressed gzip file to a text file and load it in without any issues. My real issue is getting the adif and name into the dictionary. I am going to try and pout the xlm in for two records... – Tom Mar 17 '14 at 03:15
  • @Tom - There you go. I changed my answer to include the namespace. It works for me now. – Enigmativity Mar 17 '14 at 03:26
  • @Enigmativity It works! Not with the memory stream but with the file which I think is preferable in my case (download once a month). You have no idea how much time I spent! I was about to do the old file IO with strings etc, lol. Nonetheless, in the process I did learn a lot about XML last night and was going to try and tackle it with a fresh start today. Once again, thanks for making my day! – Tom Mar 17 '14 at 16:10
  • @Tom - there is a tick just to the left of the question. Click on the tick - it should turn from white to green. – Enigmativity Mar 17 '14 at 23:55