How to remove all namespaces from broken XML with C#?

Question

Here is how to remove all namespace from xml. But it is not working for me. Because sometimes I am getting broken xml feed. eg:

<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress.com" -->
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
<channel>
  <title>sabri ?lker - WordPress.com Search</title>
  <link>http://tr.search.wordpress.com/?q=sabri+%C3%BClker&#038;page=2&#038;t=comment&#038;s=date</link>
  <description>sabri ?lker - WordPress.com Search</description>
  <pubDate>Fri, 04 Jan 2013 08:58:41 +0000</pubDate>
  <language>tr</language>
  <image><url>http://s.wordpress.com/i/buttonw-com.png</url><width>224</width><height>58</height><title>WordPress.com</title><link>http://wordpress.com/</link></image>
  <generator>http://search.wordpress.com/</generator>
  <atom:link rel="self" type="application/rss+xml" href="http://tr.search.wordpress.com/?q=sabri+%C3%BClker&#038;page=2&#038;t=comment&#038;s=date&amp;f=feed" />
  <atom:link rel="search" type="application/opensearchdescription+xml" href="http://en.search.wordpress.com/opensearch.xml" title="WordPress.com" />
  <opensearch:totalResults>10</opensearch:totalResults><opensearch:startIndex>11</opensearch:startIndex><opensearch:itemsPerPage>10</opensearch:itemsPerPage><opensearch:Query role="request" searchTerms="sabri ?lker startPage=\"2" /></channel>
</rss>

my exceptiom is "Name cannot begin with the '2' character, hexadecimal value 0x32. Line 17, position 227." to the result. So what should I do to solved this problem.

Jan Thomä · Answer 1 · 2013-01-04T09:43:22.957

0

I'd say the reason is the ill-formed searchTerms attribute:

searchTerms="sabri ?lker startPage=\"2"

It's quoted the wrong way it should use " instead of \". You could simply replace all \" with "

string input = ..; // your xml
string processedInput = input.Replace("\\\"", "&quot;");

// then feed this into your xml parser.

This should solve your issue, but it's of course not a general way of sanitizing wrong xml input. You may want to have a look at http://tidyfornet.sourceforge.net/ it can sanitize HTML, XHTML and XML.

edited Jan 04 '13 at 09:43

answered Jan 04 '13 at 09:37

Jan Thomä

13,296
6
55
83

thanks your replay. But your solution isn't reality. http://mehmetyldz.wordpress.com/fenerbahce-tarihi/ end then getting exception that is "'89f' is an unexpected token. The expected token is '='. Line 78, position 1." – RockOnGom Jan 04 '13 at 10:00
Your link is broken, could you modify your comment and fix the link? Also have you tried applying Tidy before feeding the RSS feed into your xml parser? – Jan Thomä Jan 04 '13 at 11:29
1

@John it is not my link. it is just a content in rss feed.(eg http://en.search.wordpress.com/?q=elma&f=feed) What is Tidy ? – RockOnGom Jan 04 '13 at 11:49
Tidy is a library which can fix broken XML. I have provided a link to it in my answer. You can use this library to fix the XML before you give it to your xml parser. – Jan Thomä Jan 04 '13 at 13:50
Ah i see it's the 89f in the GUID tag. I assume that wordpress is producing correct RSS-Feeds, so this looks like a chunking issue. How exactly do you read this RSS-feed in your program? Could you add this to the question, please? – Jan Thomä Jan 04 '13 at 13:55

How to remove all namespaces from broken XML with C#?

1 Answers1