1

I'm trying to do a bit of scraping in a c# application.

I am trying to access 4 pieces of information on the following page: https://smstestbed.nist.gov/vds/current

  • CreationTime
  • Availibility
  • Linear X and Y coords

The following function is where I am polling a live data feed from a remote machining tool. The problem I have is that whilst I have been able to print 'CreationTime' to a terminal, my XPath use is horrifically clunky and as far as This Link seems to suggest I should be able to do what I am doing in the 2 lines after my comment

"//This should be a far better way of accessing the data but for some reason the second line fails"

Unfortunately I am getting AvailabilityNode was Null.

public static void PollNIST()
    {
        string NISTSourceURL = "https://smstestbed.nist.gov/vds/current";  // Gives us a human friendly reference to the HTM
        //-------------------------------- Current (mostly) Working Version---------------------------------------------------------------------------------
        // Retrieve raw HTML
        var NISTTargetURL = NISTSourceURL;
        var NISTHttpClient = new HttpClient();
        var NISTXMLRaw = NISTHttpClient.GetStringAsync(NISTTargetURL);  // We now have all of the HTML / XML Data as a raw string
                                                                        //Console.WriteLine(MazXMLRaw.Result);                   // Prints the resulting HTML to a terminal as a debug tool    (Works)   
        XmlDocument CurNISTXML = new XmlDocument();               // Generate Blank XML Doc
        CurNISTXML.LoadXml(NISTXMLRaw.Result);                     // This (".result") passes the actual string?, should then be loaded into new XML file

        var elementHeader = CurNISTXML.GetElementsByTagName("Header");
        var curNISTHeader = elementHeader.Item(0);
        var creationTime = curNISTHeader.Attributes[0];  // We actually have the creationTime            
        string CurNISTTime = creationTime.InnerText; ; //      //*[@id="mtconnect content"]/ul/li[1]

        //This should be a far better way of accessing the data but for some reason the second line fails
        XmlNode AvailabilityNode = CurNISTXML.SelectSingleNode("/table[1]/tbody/tr[1]");  //*[@id="mtconnect content"]/table[1]/tbody/tr[1]/td[7] // Xpath Availability
        var CurNISTStatus = AvailabilityNode.InnerText; //      //*[@id="mtconnect content"]/ul/li[1]


        string CurNistX = ""; //      //*[@id="mtconnect content"]/table[5]/tbody/tr/td[7]
        string CurNistY = ""; //      //*[@id="mtconnect content"]/table[6]/tbody/tr/td[7]

        Console.WriteLine("-------BEGIN NIST DATA PACKET-------");
        Console.WriteLine("NIST Time  : " + creationTime.InnerText);
        Console.WriteLine("NIST Status: " + CurNISTStatus);    
        Console.WriteLine("NIST X Pos.: " + CurNistX);
        Console.WriteLine("NIST Y Pos.: " + CurNistY);
        Console.WriteLine("--------END NIST DATA PACKET--------");

        //var currentNIST = new NISTDataSet()// Create new instance ofNISTdata object
    }

Any ideas?

chriga
  • 798
  • 2
  • 12
  • 29
GigaJoules
  • 53
  • 9
  • 1
    YOu are trying to parse an html webpage using xml. YOu are using the wrong URL. The data is avaiable as XML but you need to use s different URL. See : https://www.nist.gov/programs-projects/materials-data-curation-system – jdweng Nov 06 '18 at 10:51
  • Are you sure? If I print the XML doc to console it's all there, and creationtime works just fine. – GigaJoules Nov 06 '18 at 10:56
  • This is my first time writing c# so I'm getting stuck with things that are probably quite simple – GigaJoules Nov 06 '18 at 11:07
  • What xml link are you using? What you posted is only html. – jdweng Nov 06 '18 at 11:25
  • The timestamp is gained only using the link given in the first line of the method – GigaJoules Nov 06 '18 at 11:27
  • When I view source it appears the link ending "/vds/current" is the path to the XML? – GigaJoules Nov 06 '18 at 11:31
  • The smstestbed has a schema location at the top of the xml file. Get the schema from location. Then use the msdn xsd.exe tool to convert xml to classes (option /cl /l:cs). Then use xml serialization to parse data. – jdweng Nov 06 '18 at 11:31
  • Go to URL with browser. An xml file starts with : – jdweng Nov 06 '18 at 12:34
  • Any idea how I can just directly address the XML? – GigaJoules Nov 13 '18 at 10:40
  • I think I've mixed up 'view source' and 'inspect'. When I hit view source I only see XML. – GigaJoules Nov 13 '18 at 10:44
  • When I do a Console.WriteLine(CurNistXML.InnerXML); i get something that starts with – GigaJoules Nov 13 '18 at 10:48
  • Does the xml contain tag "Header" [CurNISTXML.GetElementsByTagName("Header");]. I think the xml is embedded in Html. The Header tag is part of the HTML and doesn't exist in the Xml. – jdweng Nov 13 '18 at 11:53

2 Answers2

1

The XPath expression

/table[1]/tbody/tr[1]

will succeed only if the outermost element of the document is a table element, which seems unlikely. I haven't tried to understand the logic of the page or of your code, but this definitely looks wrong. "/" at the start of a path expression selects from the root of the tree.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Yeah I though that, I've tried several different things there which is why I think that single slash is there – GigaJoules Nov 06 '18 at 12:39
  • @GigaJoules Does '//table[1]/tbody/tr[1]' select what you wanted? It is unclear to me which element you are trying to select. – Mate Mrše Nov 06 '18 at 14:11
  • @GigaJoules We see a lot of questions where people have scattered random punctuation around their XPath expressions in the hope that it will act as magic fairy dust. It's rarely an effective strategy. Save yourself time, read the manual. – Michael Kay Nov 06 '18 at 15:57
  • I'm looking to pull the word 'available' from the top right cell of the first table, and the 'value' number of tables 'linear x' and 'linear y' – GigaJoules Nov 07 '18 at 11:33
  • Going for the attribute ID was a far better option in the end, as each element has a unique identifier and only occurs once. – GigaJoules Feb 14 '19 at 10:20
0

So it turns out there was nothing wrong with how I was extracting the XML, only with my Paths.

public static void PollNIST()
        {
            string NISTSourceURL = "https://smstestbed.nist.gov/vds/current";  // Gives us a human friendly reference to the HTMl
            // string NistXmlUrl = // Someone on stackexchange is claiming that there is another url for the XML but viewsource says otherwise 
            //-------------------------------- Current (mostly) Working Version---------------------------------------------------------------------------------
            var NISTHttpClient = new HttpClient();
            var NISTXMLRaw = NISTHttpClient.GetStringAsync(NISTSourceURL);  // We now have all of the HTML / XML Data as a raw string
                                                                            //Console.WriteLine(MazXMLRaw.Result);                   // Prints the resulting HTML to a terminal as a debug tool    (Works)   
            XmlDocument CurNISTXML = new XmlDocument();               // Generate Blank XML Doc
            CurNISTXML.LoadXml(NISTXMLRaw.Result);                     // This (".result") passes the actual string?, should then be loaded into new XML file

            // Get CreationTime (WORKING!)
            XmlNodeList elementHeader = CurNISTXML.GetElementsByTagName("Header");
            XmlNode curNISTHeader = elementHeader.Item(0);
            XmlAttribute creationTime = curNISTHeader.Attributes[0];  // We now have the creationTime element          
            string CurNISTTime = creationTime.InnerText;  //      //*[@id="mtconnect content"]/ul/li[1]

            // Get availability (WORKING!)
            XmlNodeList nodeAvailability = CurNISTXML.GetElementsByTagName("Availability");
            XmlNode availability = nodeAvailability.Item(0); // I think this is maybe a bit of a hackish / improper way to do this?
            string curNISTStatus = availability.InnerText;

            //Get linear tool X Coord.
            XmlNodeList deviceStream = CurNISTXML.GetElementsByTagName("ComponentStream");
            XmlNode linearCompXStream = deviceStream.Item(4);
            string curNISTX = linearCompXStream.InnerText; //  We do not need to break down the nodes any further as the value is the only text within

            //Get Linear tool y Coord.            
            XmlNode linearCompYStream = deviceStream.Item(5);
            string curNISTY = linearCompYStream.InnerText; //  We do not need to break down the nodes any further as the value is the only text within


            Console.WriteLine("-------BEGIN NIST DATA PACKET-------");
            Console.WriteLine("NIST Time  : " + creationTime.InnerText);
            Console.WriteLine("NIST Status: " + curNISTStatus);    
            Console.WriteLine("NIST X Pos.: " + curNISTX);
            Console.WriteLine("NIST Y Pos.: " + curNISTY);
            Console.WriteLine("--------END NIST DATA PACKET--------");

            //var currentNIST = new NISTDataSet()// Create new instance ofNISTdata object
        }

works nicely.

GigaJoules
  • 53
  • 9