29

My post below asked what the curly quotation marks were and why my app wouldn't work with them, my question now is how can I replace them when my program comes across them, how can I do this in C#? Are they special characters?

curly-quotation-marks-vs-square-quotation-marks-what-gives

Thanks

Community
  • 1
  • 1

12 Answers12

55

A more extensive listing of problematic word characters

if (buffer.IndexOf('\u2013') > -1) buffer = buffer.Replace('\u2013', '-');
if (buffer.IndexOf('\u2014') > -1) buffer = buffer.Replace('\u2014', '-');
if (buffer.IndexOf('\u2015') > -1) buffer = buffer.Replace('\u2015', '-');
if (buffer.IndexOf('\u2017') > -1) buffer = buffer.Replace('\u2017', '_');
if (buffer.IndexOf('\u2018') > -1) buffer = buffer.Replace('\u2018', '\'');
if (buffer.IndexOf('\u2019') > -1) buffer = buffer.Replace('\u2019', '\'');
if (buffer.IndexOf('\u201a') > -1) buffer = buffer.Replace('\u201a', ',');
if (buffer.IndexOf('\u201b') > -1) buffer = buffer.Replace('\u201b', '\'');
if (buffer.IndexOf('\u201c') > -1) buffer = buffer.Replace('\u201c', '\"');
if (buffer.IndexOf('\u201d') > -1) buffer = buffer.Replace('\u201d', '\"');
if (buffer.IndexOf('\u201e') > -1) buffer = buffer.Replace('\u201e', '\"');
if (buffer.IndexOf('\u2026') > -1) buffer = buffer.Replace("\u2026", "...");
if (buffer.IndexOf('\u2032') > -1) buffer = buffer.Replace('\u2032', '\'');
if (buffer.IndexOf('\u2033') > -1) buffer = buffer.Replace('\u2033', '\"');
Nick van Esch
  • 1,017
  • 9
  • 8
  • 4
    I'm curious, has anyone done performance testing that shows .IndexOf() is cheaper than running .Replace() on a string that doesn't contain the character? – Ted A. May 15 '14 at 19:38
  • 3
    The cheapest operation would be to iterate the string a single time, versus iterating possibly up to 2 * number of characters addressed. Eg: `foreach(char c in buffer) { /* if char in list to be replaced, replace */ }`. – Dan Jan 19 '15 at 18:08
  • I'm with Ted A. here... buffer.Replace has to be doing essentially the equivalent of buffer.IndexOf internally anyway; I can't imagine any circumstances under which the if-test and IndexOf call improves things. BUT, this looks like a great list of replacements, so thank you! – Joe Strout Apr 27 '18 at 14:55
  • 1
    Very useful list, although u201f (double high-reversed-9 quotation mark) is missing. – Moo May 23 '19 at 11:30
  • Here is a link to a list of general punctuation Unicode Characters you might consider as well [link](https://www.fileformat.info/info/unicode/block/general_punctuation/images.htm) – Paul Nakitare Oct 06 '22 at 12:14
26

When I encountered this problem I wrote an extension method to the String class in C#.

public static class StringExtensions
{
    public static string StripIncompatableQuotes(this string inputStr)
    {
        if (string.IsNullOrWhiteSpace(inputStr))
        {
            return inputStr;
        }
        
        return inputStr.Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"');
    }
}

This simply replaces the silly 'smart quotes' with normal quotes.

[EDIT] Fixed to also support replacement of 'double smart quotes'.

Marcel Gruber
  • 6,668
  • 6
  • 34
  • 60
Matthew Ruston
  • 4,282
  • 7
  • 38
  • 47
13

To extend on Nick van Esch's popular answer, here is the code with the names of the characters in the comments.

if (buffer.IndexOf('\u2013') > -1) buffer = buffer.Replace('\u2013', '-'); // en dash
if (buffer.IndexOf('\u2014') > -1) buffer = buffer.Replace('\u2014', '-'); // em dash
if (buffer.IndexOf('\u2015') > -1) buffer = buffer.Replace('\u2015', '-'); // horizontal bar
if (buffer.IndexOf('\u2017') > -1) buffer = buffer.Replace('\u2017', '_'); // double low line
if (buffer.IndexOf('\u2018') > -1) buffer = buffer.Replace('\u2018', '\''); // left single quotation mark
if (buffer.IndexOf('\u2019') > -1) buffer = buffer.Replace('\u2019', '\''); // right single quotation mark
if (buffer.IndexOf('\u201a') > -1) buffer = buffer.Replace('\u201a', ','); // single low-9 quotation mark
if (buffer.IndexOf('\u201b') > -1) buffer = buffer.Replace('\u201b', '\''); // single high-reversed-9 quotation mark
if (buffer.IndexOf('\u201c') > -1) buffer = buffer.Replace('\u201c', '\"'); // left double quotation mark
if (buffer.IndexOf('\u201d') > -1) buffer = buffer.Replace('\u201d', '\"'); // right double quotation mark
if (buffer.IndexOf('\u201e') > -1) buffer = buffer.Replace('\u201e', '\"'); // double low-9 quotation mark
if (buffer.IndexOf('\u2026') > -1) buffer = buffer.Replace("\u2026", "..."); // horizontal ellipsis
if (buffer.IndexOf('\u2032') > -1) buffer = buffer.Replace('\u2032', '\''); // prime
if (buffer.IndexOf('\u2033') > -1) buffer = buffer.Replace('\u2033', '\"'); // double prime
  • 4
    Hi Barbara. Useful addition to the answer, but this would be better suited as a suggested edit to the existing answer instead of a new one. – user247702 May 15 '15 at 15:08
  • @Barbara, Hi , is not there any method that can replace all characters without specifying every character manually. If in future, there are another characters apart from specified above in code, then? – Anish Mittal Jul 28 '17 at 06:51
  • It is the current requirement in our case. Any special character from MS Word file can come and it should be converted to Straight character and shown properly. – Anish Mittal Jul 28 '17 at 06:52
6

Note that what you have is inherently a corrupt CSV file. Indiscriminately replacing all typographer's quotes with straight quotes won't necessarily fix your file. For all you know, some of the typographer's quotes were supposed to be there, as part of a field's value. Replacing them with straight quotes might not leave you with a valid CSV file, either.

I don't think there is an algorithmic way to fix a file that is corrupt in the way you describe. Your time might be better spent investigating how you come to have such invalid files in the first place, and then putting a stop to it. Is someone using Word to edit your data files, for instance?

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
3

The VB equivalent of what @Matthew wrote:

Public Module StringExtensions

    <Extension()>
    Public Function StripIncompatableQuotes(BadString As String) As String
        If Not String.IsNullOrEmpty(BadString) Then
            Return BadString.Replace(ChrW(&H2018), "'").Replace(ChrW(&H2019), "'").Replace(ChrW(&H201C), """").Replace(ChrW(&H201D), """")
        Else
            Return BadString
        End If
    End Function
End Module
cjbarth
  • 4,189
  • 6
  • 43
  • 62
3

According to the Character Map application that comes with Windows, the Unicode values for the curly quotes are 0x201c and 0x201d. Replace those values with the straight quote 0x0022, and you should be good to go.

String.Replace(0x201c, '"');
String.Replace(0x201d, '"');
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
3

I have a whole great big... program... that does precisely this. You can rip out the script and use it at your leasure. It does all sorts of replacements, and is located at http://bitbucket.org/nesteruk/typografix

Dmitri Nesteruk
  • 23,067
  • 22
  • 97
  • 166
2

Using Nick and Barbara's answers, here is example code with performance stats for 1,000,000 loops on my machine:

input = "shmB6BhLe0gdGU8OxYykZ21vuxLjBo5I1ZTJjxWfyRTTlqQlgz0yUtPu8iNCCcsx78EPsObiPkCpRT8nqRtvM3Bku1f9nStmigaw";
input.Replace('\u2013', '-'); // en dash
input.Replace('\u2014', '-'); // em dash
input.Replace('\u2015', '-'); // horizontal bar
input.Replace('\u2017', '_'); // double low line
input.Replace('\u2018', '\''); // left single quotation mark
input.Replace('\u2019', '\''); // right single quotation mark
input.Replace('\u201a', ','); // single low-9 quotation mark
input.Replace('\u201b', '\''); // single high-reversed-9 quotation mark
input.Replace('\u201c', '\"'); // left double quotation mark
input.Replace('\u201d', '\"'); // right double quotation mark
input.Replace('\u201e', '\"'); // double low-9 quotation mark
input.Replace("\u2026", "..."); // horizontal ellipsis
input.Replace('\u2032', '\''); // prime
input.Replace('\u2033', '\"'); // double prime

Time: 958.1011 milliseconds

input = "shmB6BhLe0gdGU8OxYykZ21vuxLjBo5I1ZTJjxWfyRTTlqQlgz0yUtPu8iNCCcsx78EPsObiPkCpRT8nqRtvM3Bku1f9nStmigaw";
var inputArray = input.ToCharArray();
for (int i = 0; i < inputArray.Length; i++)
{
    switch (inputArray[i])
    {
        case '\u2013':
            inputArray[i] = '-';
            break;
        // en dash
        case '\u2014':
            inputArray[i] = '-';
            break;
        // em dash
        case '\u2015':
            inputArray[i] = '-';
            break;
        // horizontal bar
        case '\u2017':
            inputArray[i] = '_';
            break;
        // double low line
        case '\u2018':
            inputArray[i] = '\'';
            break;
        // left single quotation mark
        case '\u2019':
            inputArray[i] = '\'';
            break;
        // right single quotation mark
        case '\u201a':
            inputArray[i] = ',';
            break;
        // single low-9 quotation mark
        case '\u201b':
            inputArray[i] = '\'';
            break;
        // single high-reversed-9 quotation mark
        case '\u201c':
            inputArray[i] = '\"';
            break;
        // left double quotation mark
        case '\u201d':
            inputArray[i] = '\"';
            break;
        // right double quotation mark
        case '\u201e':
            inputArray[i] = '\"';
            break;
        // double low-9 quotation mark
        case '\u2026':
            inputArray[i] = '.';
            break;
        // horizontal ellipsis
        case '\u2032':
            inputArray[i] = '\'';
            break;
        // prime
        case '\u2033':
            inputArray[i] = '\"';
            break;
        // double prime
    }
}
input = new string(inputArray);

Time: 362.0858 milliseconds

  • 1
    Interesting, but that's a very short string in the scheme of things. It would be much more interesting to try a random 10-char, 100-char, 1000-char and 10,000-char string. That would give some idea of the order of the imporvement. I know this is old, though ;-) – Ben McIntyre Jul 30 '21 at 01:32
1

Try this for smart single quotes if the above don't work:

string.Replace("\342\200\230", "'")
string.Replace("\342\200\231", "'")

Try this as well for smart double quotes:

string.Replace("\342\200\234", '"')
string.Replace("\342\200\235", '"')
takrl
  • 6,356
  • 3
  • 60
  • 69
1

I also have a program which does this, the source is in this file of CP-1252 Fixer. It additionally defines some mappings for converting characters within RTF strings whilst preserving all formatting, which may be useful to some.

It is also a complete mapping of all "smart quote" characters to their low-ascii counterparts, entity codes and character references.

pospi
  • 3,540
  • 3
  • 27
  • 26
1

just chiming in, I had done this with Regex replace just to handle a few at a time based on what I'm replacing them with:

        public static string ReplaceWordChars(this string text)
        {
            var s = text;
            // smart single quotes and apostrophe,  single low-9 quotation mark, single high-reversed-9 quotation mark, prime
            s = Regex.Replace(s, "[\u2018\u2019\u201A\u201B\u2032]", "'");
            // smart double quotes, double prime
            s = Regex.Replace(s, "[\u201C\u201D\u201E\u2033]", "\"");
            // ellipsis
            s = Regex.Replace(s, "\u2026", "...");
            // em dashes
            s = Regex.Replace(s, "[\u2013\u2014]", "-");
            // horizontal bar
            s = Regex.Replace(s, "\u2015", "-");
            // double low line
            s = Regex.Replace(s, "\u2017", "-");
            // circumflex
            s = Regex.Replace(s, "\u02C6", "^");
            // open angle bracket
            s = Regex.Replace(s, "\u2039", "<");
            // close angle bracket
            s = Regex.Replace(s, "\u203A", ">");
            // weird tilde and nonblocking space
            s = Regex.Replace(s, "[\u02DC\u00A0]", " ");
            // half
            s = Regex.Replace(s, "[\u00BD]", "1/2");
            // quarter
            s = Regex.Replace(s, "[\u00BC]", "1/4");
            // dot
            s = Regex.Replace(s, "[\u2022]", "*");
            // degrees 
            s = Regex.Replace(s, "[\u00B0]", " degrees");

            return s;
        }

Also a few more replacements in there.

Ed Cayce
  • 23
  • 5
0

it worked for me, you can try below code

string replacedstring = ("your string with smart quotes").Replace('\u201d', '\'');

Thanks!

Asif Ghanchi
  • 236
  • 3
  • 11