1

Other ASCII codes are doing the same thing.

Just to give you some background, these codes are part of the HTML that I'm reading from WordPress blog posts. I'm porting them over to BlogEngine.NET using a little C# WinForm app I wrote. Do I need to do some kind of conversion as I port them over to BlogEngine.NET (as XML files)?

It'd sure be nice if they just displayed properly without any intervention on my part.

Here's a code fragment from one of the WordPress source pages:

<link rel="alternate" type="application/rss+xml" title="INRIX® Traffic &raquo; Taking the &#8220;E&#8221; out of your &#8220;ETA&#8221; Comments Feed" href="http://www.inrixtraffic.com/blog/2012/taking-the-e-out-of-your-eta/feed/" />

Here's the corresponding chunk of XML that's in the XML file I output during the conversion:

<title>Taking the &amp;#8220;E&amp;#8221; out of your &amp;#8220;ETA&amp;#8221;</title>

UPDATE.

Tried this, but still no dice.

writer.WriteElementString("title", string.Format("<![CDATA[{0}]]>", post.Title));

...outputs this:

<title>&lt;![CDATA[Taking the &amp;#8220;E&amp;#8221; out of your &amp;#8220;ETA&amp;#8221;]]&gt;</title>
birdus
  • 7,062
  • 17
  • 59
  • 89
  • 1
    Is there any reason for not using the equivalent html escape characters? – Alex Lynham Apr 10 '13 at 20:45
  • Can you give us some sample HTML? Are you sure the `&` doesn't get encoded into `&` and therefore shows up in the browser as `“`? – Steve Apr 10 '13 at 20:45
  • They're probably write with the encoding occurring at that stage. I've had a similar problem with other CMSs reading for databases that had different char sets than the input method. Definitely post some HTML. – David R. Apr 10 '13 at 20:50
  • I think you might be onto something, @Steve. How do I deal with that? – birdus Apr 10 '13 at 20:50
  • You're trying to get it to encode it twice? – David R. Apr 10 '13 at 20:51
  • No. Not trying. Just reading it in, then spitting it out. – birdus Apr 10 '13 at 20:51
  • @AlexLynham The way I figure it, I'd need to construct a complete table of ASCII values in my application, then map the numerical ASCII values to the HTML escape values. That would be a royal pain. There has to be a much simple way. – birdus Apr 10 '13 at 21:04

3 Answers3

3

Since the data you are getting from Wordpress is already encoded you can decode it to a regular string and then let the XMLWriter encode it properly for XML.

string input = "Taking the &#8220;E&#8221; out of your &#8220;ETA&#8221;";
string decoded = System.Net.WebUtility.HtmlDecode(input);
//decoded = Taking the "E" out of your "ETA"

This may not be very efficient, but since this sounds like a one time conversion I don' think it will be an issue.

A similar question was asked here: How can I decode HTML characters in C#?

Community
  • 1
  • 1
leemicw
  • 751
  • 8
  • 15
  • WONDERFUL! Thank you! You are right on all counts. One time, so efficiency doesn't matter. Works perfectly! – birdus Apr 11 '13 at 21:14
0

As I pointed out in my comment above: Your problem is that your &#220; gets encoded into &amp;8220;. When you output this in the browser it displays as &#220;

I don't know how your porting works, but to fix this issue, you need to make sure that the & in the ASCII codes doesn't get encoded to &amp;

Steve
  • 8,609
  • 6
  • 40
  • 54
  • When I read in the HTML from the WordPress page, something like ’ actually gets stored into a string. Then, I'm writing that out using XMLWriter and WriteElementString(). Do you know how I can disable any "favors" that it thinks it's doing me? – birdus Apr 10 '13 at 20:59
  • You might want to take a look at this SO Question: http://stackoverflow.com/questions/2176843/how-to-prevent-the-conversion-of-to-amp-using-xmltextwriter – Steve Apr 10 '13 at 21:04
  • That's funny. I was just looking at that. Still doesn't seem to give me a solution, though. I've tried tweaking a couple settings (Encoding and CheckCharacters) of the XmlWriter, but it keeps outputting the same thing. And WriteRaw won't let me specify the XML element name. – birdus Apr 10 '13 at 21:25
  • I didn't. I'll give that a shot. – birdus Apr 10 '13 at 22:05
  • Tried CDATA. No dice. Maybe I did it wrong. I posted it in the original question up top. – birdus Apr 10 '13 at 22:14
0

Any chance CDATA tags solve the issue? Just make sure the text is correct in the source XML file. You don't need the ampersand magic (in the source) if you use CDATA tags.

<some_tag><![CDATA[Taking the “ out of your ...]]></some_tag>
Erik Nijland
  • 1,181
  • 2
  • 9
  • 24