Parsing a feed in C#

Question

I am having problems parsing a feed in C#.

I cannot get the authors of the feeds to change the code so I have to handle it.

I have tried passing the feed straight into the XmlDocument object as a URL, or obtaining it with WebClient as text, trimming it to remove any space that seems to be put in front of it for some reason and then use the LoadXML method to load it.

You can see an example of the feed here > http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml

I cannot get past either the

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(feedURL);

Or with a string.

XmlDocument xmlDoc = new XmlDocument();
string feedAsString = "";
// get from web as string
var webClient = new WebClient();

// Tell them who we are for white listing
webClient.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");

// fetch feed as string
var content = webClient.OpenRead(feedURL);
var contentReader = new StreamReader(content);
var rssFeedAsString = contentReader.ReadToEnd();
rssFeedAsString = rssFeedAsString.Trim(); // remove any white space beore the feed
xmlDoc.LoadXml(feedAsString);

The errors I get are:

Root element is missing.
Could not extract first items from feed string; Error The element with name 'jobs' and namespace '' is not an allowed feed format.

I want to use xpath /jobs/job/ to loop through the feed nodes.

I have parsed feeds like this before with XmlDocument passing in just a URL and if not then a string.

I am thinking of resorting to using regular expressions to loop through the feeds using a <job>[\s\S]+></job> type expression.

However I would rather use standard methods.

As I cannot get the feeds changed can anyone tell me what is wrong with the feed and the way I am parsing it. Forgive the use of var I was just knicked a snippet of code to parse a feed from an example that was using it. I am using strong types every where else and will convert it once I get it working.

Any help would be much appreciated.

Thanks

Jon Skeet · Answer 1 · 2014-06-09T13:52:57.303

0

EDIT: The reason your current code is failing is pretty simple - you're trying to parse an empty string:

string feedAsString = "";
...
var rssFeedAsString = contentReader.ReadToEnd();
rssFeedAsString = rssFeedAsString.Trim();
xmlDoc.LoadXml(feedAsString);

You're never setting feedAsString to a new value - but you're fetching the text as rssFeedAsString. Those are two different variables.

That said, I'd use a different approach entirely. I don't see any need for trimming etc - or using XPath, or passing it through an RSS reader (given that it's not RSS). The only tricky part is explicitly specifying a User-Agent header, as otherwise the server rejects the request.

Personally I'd use LINQ to XML, which seems to be fine:

using System;
using System.Net;
using System.Xml.Linq;

class Test
{
    static void Main()
    {
        string text;
        using (var webClient = new WebClient())
        {
            string url = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";
            webClient.Headers.Add("user-agent", "Mozilla/5.0");
            text = webClient.DownloadString(url);
        }
        var doc = XDocument.Parse(text);
        foreach (var job in doc.Root.Elements("job"))
        {
            Console.WriteLine(job);
        }
    }
}

edited Jun 09 '14 at 13:52

answered Jun 09 '14 at 08:35

Jon Skeet

1,421,763
867
9,128
9,194

Hi but if you notice in the example I was setting a user-agent on the download of the XML as a string before passing it into the XmlDocument. What is the difference? – MonkeyMagix Jun 09 '14 at 11:56
@MonkeyMagix: Well it could be due to the character encoding - you're assuming it's in UTF-8, for example. Fundamentally I just used an easier approach to fetching the data. We could look into what's wrong in more detail, but I'd just stick with the simple, working code :) – Jon Skeet Jun 09 '14 at 11:58
Okay, so I'm used to parsing a feed with xpath and using foreach (XmlNode child in childNodes) and if(child.node.innerText == "jobtitle") - so what would be the equivalent using var job? I have the field names I need to check in an array so I need to check each node inside /job/jobs/ e.g jobTitle, jobRef, jobDesc - and save them. Also why is everyone is moving to using var instead of strong types nowadays? – MonkeyMagix Jun 09 '14 at 12:26
@MonkeyMagix: What makes you think `var` isn't strongly typed? It's just a matter of *implicit* rather than *explicit* typing. It sounds like you should read a tutorial on LINQ to XML though, as a lot of your questions will be answered there. (Not sure what you mean about `jobRef` as that doesn't appear in the sample document - you should generally be using the element names, which is really easy in LINQ to XML.) – Jon Skeet Jun 09 '14 at 12:29
It's just I am writing a conversion project from Betfair SOAP to JSON and now everything, even strings, integers & other objects are being used with var. You cannot pass a var from method to method & I've read a lot of articles against it. I am not an expert I just wanted to know why people would use var WebClient = new WebClient() instead of WebClient WebClient = new WebClient. JobRef,JobTitle etc are all the Field Names in OUR DB. I have a mapping table which maps their node names in the XML to our correct field names. So I need to get the node in the XML but save the correct FieldName. – MonkeyMagix Jun 09 '14 at 12:47
"You cannot pass a var from method to method" - I think you've misunderstood what `var` is about. `var webClient = new WebClient();` is *exactly* equivalent to `WebClient webClient = new WebClient();`. Please read up on the features of C# 3 before deciding not to use them. I'm not sure how relevant the rest of your comment is, but I would definitely recommend using element names rather than `InnerText`. – Jon Skeet Jun 09 '14 at 12:50
I have read about var > http://stackoverflow.com/questions/41479/use-of-var-keyword-in-c-sharp is just one example with pros/cons Also why do people mix their code up with the use of var and other types e.g your string text; Why did you not use var text; there instead? Just curious to when to use it & not to. The problem is I have 2 feeds, 2 different node sets, 1 normalised DB. I have to map the nodes to the correct fields before saving them to the DB. So I need to know the correct Field Name to save which is NOT the same as the node name. – MonkeyMagix Jun 09 '14 at 13:00
So if you can imagine I have a DataTable that hold the "mapped node name" and the "DB Field Name". Then I need to check each node and it's name in the loop. Pass that AND the value (innerText) or node.value in my method to get the correct FieldName. Which I can then save to the DB. Therefore if not using xpath I need an example of checking node names with or without var etc. I am hacking about at the moment - got passed load errors but getting "no node" errors back at the min – MonkeyMagix Jun 09 '14 at 13:08
@MonkeyMagix: No, you don't need an example without using `var` - you need to understand what `var` means so that you can very easily read code which *does* use `var`. Don't get hung up on it - and learn about LINQ to XML too, at which point you'll very easily be able to find out an element name, and search *by* element name. Basically, I've pointed you in a useful direction - but I'm not going to spoonfeed you the code. – Jon Skeet Jun 09 '14 at 13:29
I'd love to spend the time reading up about LINQ which I have used before to create generic lists with filters (for Betfair) but I have to get this finished quickly. I wanted to just be able to use xpath like I have done on previous feed imports with no problem before so I don't understand why I can't with this feed. When I have the time I will read up on LINQ to XML. At the moment I just need to iterate through the feed job node by job node and then each individual node inside WITHOUT knowing the names of them - so checking child.node.name - so I can pass it into my mapping table etc. – MonkeyMagix Jun 09 '14 at 13:38
@MonkeyMagix: I'm sorry, but that sounds very much like "Please tell me exactly the code I need, because I don't want to take the time required to read one LINQ to XML tutorial." I'm not in the business of doing anyone else's job. Just browsing the API documentation would probably be enough: Hint: `Elements()` and `XName.LocalName`. If that's not enough, I'm afraid we're done. – Jon Skeet Jun 09 '14 at 13:41
Well I am just off 2 my second GP's appointment for the day and I didn't want the code. I wanted to know WHY the code I had that works for OTHER feeds - in similar bad shape DOESN'T work. So the inverse of being spoon fed ie help with what I was doing wrong. I have xpath feed code working everywhere else so I just didn't understand why this feed wouldn't work with it. anyway have to go - thanks for your help. – MonkeyMagix Jun 09 '14 at 13:45
@MonkeyMagix: See my edit (at the start of the answer) for what's wrong in your current code. You should have been able to find that trivial error via debugging though, before ever posting on Stack Overflow. However, hopefully this will set you on a better path anyway, to cleaner code... – Jon Skeet Jun 09 '14 at 13:53
You say I am using UTF-8 and the feed is UTF-8 and contains UTF-8 characters so I doubt that is the problem. Not when I was getting "namespace '' is invalid" errors. – MonkeyMagix Jun 09 '14 at 15:14
Sorry you meant the example where I was copying and pasting and put the wrong variable in - the empty string - I guess that would have given a "no root document error". However I have it working now. Thanks – MonkeyMagix Jun 09 '14 at 15:31
@MonkeyMagix: We've no idea where you're getting the namespace is invalid error from, but I suspect that's different code again. But yes, the code you've given will yield the "Root element is missing" error. The first version of the code you've given will fail due to the server requiring a particular user agent. – Jon Skeet Jun 09 '14 at 16:10
Sorry can you explain what you mean when you say the first version will fail due to the server "requiring" a particular user-agent? I am just passing a user-agent to the 3rd party server telling them who I am. It is my own servers that I block blank user-agents on. The test page is on one of my own servers so as long as the agent isn't blank it should work but there is no code anywhere saying when extracting a feed the useragent MUST BE X or Y. Could you clarify. – MonkeyMagix Jun 10 '14 at 09:57
The code that brought me the "Error The element with name 'jobs' and namespace '' is not an allowed feed format." error was some code I was using that converted the feed to XML using LINQ which I found (I think on this site) for checking a feed was really XML etc. You can see the example code here > http://scotjobsnet.co.uk.ni.strategiesuk.net/FeedError.txt – MonkeyMagix Jun 10 '14 at 10:12
@MonkeyMagix: Your very first code snippet (just creating an `XmlDocument` and calling `xmlDoc.Load(feedURL)`) *won't* pass a specific user agent, so it will fail for that reason (you'll get a 403 response). The sample code you've given that talks about a feed failure is due to you trying to parse this as an RSS feed, when it's *not* an RSS feed at all. It's just an XML file. Don't try to handle arbitrary RSS as if it's RSS. – Jon Skeet Jun 10 '14 at 10:22
Ok I didn't realise it was RSS only as I searched for XML parser NOT an RSS parser. But I guess that explains it. – MonkeyMagix Jun 10 '14 at 11:57

score 0 · Answer 2 · answered Jun 09 '14 at 08:36

0

Silly as it sounds, try Html Agility Pack. It is designed to deal with not-so-well-formed input and you can use XPath-like expressions to traverse the tree.

answered Jun 09 '14 at 08:36

Anton Gogolev

113,561
39
200
288

score 0 · Answer 3 · answered Jun 09 '14 at 08:36

This worked for me. I used DownloadString.

        var feedURL = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";
        XmlDocument xmlDoc = new XmlDocument();
        string feedAsString = "";
        // get from web as string
        var webClient = new WebClient();

        // Tell them who we are for white listing
        webClient.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");

        // fetch feed as string
        var content = webClient.DownloadString(feedURL);
        xmlDoc.LoadXml(content);
        var jobs = xmlDoc.GetElementsByTagName("job");
        foreach (var job in jobs)
        {
           //Loop through Jobs 
        }

score 0 · Accepted Answer · answered Jun 09 '14 at 09:03

0

I used the following solution, please have a look:

        XmlDocument xdoc = new XmlDocument();
        xdoc.Load("http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml");
        if (xdoc != null)
        {
            XmlElement root = xdoc.DocumentElement;
            XmlNodeList xNodelst = root.SelectNodes("job");
            foreach (XmlNode node in xNodelst)
            {
                string location = node.SelectSingleNode("location").InnerText;
                Response.Write("<br/> location = " + location);
            }
        }

answered Jun 09 '14 at 09:03

Khurram Ishaque

778
1
9
26

1

Thanks this was exactly what I needed along with ensuring I added a user-agent (as I ban blank user-agents) to prevent script kiddy scraping. Plus I could still use my existing code node.name and node.innerText to get the values and then the correct mappings from my table to save to the DB. Thanks – MonkeyMagix Jun 09 '14 at 15:28
Hi, I don't suppose you know how I can solve Encoding issues as when I loop through the child nodes £ signs are returning as Ã‚Â£10,001 e.g CandidateSalary. I thought there might be an Encoding.UTF8 option on the .Load method but there isn't. The XML file is saved as UTF-8. The file I am piping the child node data out to is UTF-8 and I am using Encoding.UTF8 when piping it to the file (tried with and without). – MonkeyMagix Jun 10 '14 at 12:39

Parsing a feed in C#

4 Answers4