0

I'm inserting some text into a SQL Server database using some C#. I'm using System.Web.HttpUtility.HtmlEncode() to attempt to replace characters that aren't compatible with the charset of my db (set to default collation = SQL_Latin1_General_CP1_CI_AS), but it is not helping me with some "weird" characters.

For example, this dash character: ― is different from the standard dash character. (-)

HtmlEncode() doesn't replace the first dash with an entity (I think it is "—") and it ends up as a "?" in my database.

What is the best way to clean / sanitize a text field before inserting into SQL Server?

I would rather not change my database configuration / collation.

Dudeman3000
  • 551
  • 8
  • 21
  • "―" is not an HTML metacharacter, so `HtmlEncode()` is not going to do anything to it. SQL and HTML metacharacters are completely different anyway; why are you putting HTML metacharacters in an SQL database? – Dour High Arch Mar 29 '13 at 01:11
  • *Use* placeholders, and N[VAR]CHAR columns for Unicode .. not sure of a standard way to replace with HTML [*character entities*](http://dev.w3.org/html5/html-author/charref), though. (Would also have to encoded all other non-ASCII characters, even without official entity names ..) –  Mar 29 '13 at 01:13
  • Dour, I would rather not put HTML into the text fields, but I will if it gets rid of these question marks that I get for unrecognized characters. – Dudeman3000 Mar 29 '13 at 01:17
  • @Dudeman3000 The fundamental issue with the `?` is that Latin != Unicode, and `―` is not in the CP. You're putting in HTML. The question is then just how to *encode* it - which could be done entirely with HTML *character entities*. (HtmlEncode doesn't do this because normally a UTF stream is served back and such is not required to avoid preventing HTML injection messes. This task is really "How to encode a UTF-8 HTML response as ASCII?") I suggest widening the column to N[VAR]CHAR and avoid the mess .. –  Mar 29 '13 at 01:18
  • You need to use `Encoding.Convert` as explained [here](http://stackoverflow.com/questions/1922199/). HTML has nothing to do with the issue or solution. – Dour High Arch Mar 29 '13 at 01:21
  • pst, yes, I think NVARCHAR fields instead of VARCHAR fields would help me out, but what I'm really looking for is a piece of code someone's written that is going to make getting rid of / replacing incompatible characters as painless as possible. I'm hoping there is a piece of code that is basically going to take that first weird dash and replace it with a normal dash character for me (along with other replacements of course) – Dudeman3000 Mar 29 '13 at 01:21
  • @DourHighArch The `?` already comes as the result of an implicit conversion - for an unsupported character. You *can't* fit all the characters into Latin1. Even if `―` is converted to `-` (I don't know or care), there will be many characters with no equivalent mapping .. –  Mar 29 '13 at 01:21
  • @Dudeman3000 Then consider my question suggestion (as a basis for searching, at least): "How to encode Unicode HTML as ASCII?" –  Mar 29 '13 at 01:23
  • 1
    @pst, and `Encoding.Convert` does not try to fit all characters into latin1. It omits them, as I thought Duderman wanted. – Dour High Arch Mar 29 '13 at 01:23
  • @DourHighArch Does he really want *lost* data? I guess we have different ideas of clean/sanitize .. (edit in response to following comment: "I guess so") –  Mar 29 '13 at 01:24
  • Encoding.Convert is what I was looking for - thanks Dour! pst, I will remove the HTML tag and drop it from the title. --- looks like I spoke too soon... yeah, I would rather not drop characters but in this particular case it might be an option. – Dudeman3000 Mar 29 '13 at 01:24
  • You can store text in any encoding in blob fields (`image` in SQL Server terms) - that, of course, means change in the table schema. – Igor Mar 29 '13 at 01:32
  • That sounds pretty good Igor... I'll update this post if I make that change - thanks! At this point I think that maybe the only better solution would be a piece of brute force "replace this character with that character" C# code. Shotgun not writing that piece of code... – Dudeman3000 Mar 29 '13 at 01:40

0 Answers0