0

I want to replace accented chars (such as á, ñ, ¿, ¡, etc.) with the corresponding HTML codes (such as á, ñ, ¿, ¡, etc.).

For example, this line of text:

Imposible me ha sido rehusarme á las repetidas instancias que el Caballero Trelawney, el Doctor Livesey y otros muchos señores me

...should become:

Imposible me ha sido rehusarme á las repetidas instancias que el Caballero Trelawney, el Doctor Livesey y otros muchos señores me

This should be simple. I've got this code to make the attempt:

private void buttonReplaceCharsWithCodes_Click(object sender, EventArgs e)
{
    String fallName = String.Empty;
    List<String> linesModified = new List<string>();
    StreamReader file = null;

    try // finally
    {
        try // catch
        {

            DialogResult result = openFileDialog1.ShowDialog();
            if (result == DialogResult.OK)
            {
                fallName = openFileDialog1.FileName;
            }
            file = new StreamReader(fallName);
            String line;
            while ((line = file.ReadLine()) != null)
            {
                linesModified.Add(line);
            }

            progressBar1.Maximum = linesModified.Count;
            progressBar1.Value = 0;
            labelProgFeedback.Text = "Replacing accented chars with HTML codes";

            for (int i = 0; i < linesModified.Count; i++)
            {
                linesModified[i] = linesModified[i].Replace("á", "&aacute;");
                linesModified[i] = linesModified[i].Replace("Á", "&Aacute;");
                linesModified[i] = linesModified[i].Replace("é", "&eacute;");
                linesModified[i] = linesModified[i].Replace("É", "&Eacute;");
                linesModified[i] = linesModified[i].Replace("í", "&iacute;");
                linesModified[i] = linesModified[i].Replace("Í", "&Iacute;");
                linesModified[i] = linesModified[i].Replace("ñ", "&ntilde;");
                linesModified[i] = linesModified[i].Replace("Ñ", "&Ntilde;");
                linesModified[i] = linesModified[i].Replace("ó", "&oacute;");
                linesModified[i] = linesModified[i].Replace("Ó", "&Oacute;");
                linesModified[i] = linesModified[i].Replace("ú", "&uacute;");
                linesModified[i] = linesModified[i].Replace("Ú", "&Uacute;");
                linesModified[i] = linesModified[i].Replace("ü", "&uuml;");
                linesModified[i] = linesModified[i].Replace("Ü", "&Uuml;");
                linesModified[i] = linesModified[i].Replace("¿", "&iquest;");
                linesModified[i] = linesModified[i].Replace("¡", "&iexcl;");
                progressBar1.PerformStep();
            }
            progressBar1.Value = 0;
        }
        catch (Exception ex)
        {
            MessageBox.Show(String.Format("Exception {0}", ex.Message));
        }
    }
    finally
    {
        String massagedFileName = String.Format("{0}_Massaged.txt", fallName);
        File.WriteAllLines(massagedFileName, linesModified);
        file.Close();
    }

}

Unfortunately, it doesn't work. It replaces the accented chars with the "what the heck?!?" symbol (�) instead of the HTML code desired. What is required to get this to work?

UPDATE

In answer to the comments, this is the contents of the file I load:

Imposible me ha sido rehusarme á las repetidas instancias que el Caballero Trelawney, el Doctor Livesey y otros muchos señores me han hecho para que escribiese la historia circunstanciada y completa de la Isla del Tesoro. Voy, pues, á poner manos á la obra contándolo todo, desde el alfa hasta el omega, sin dejarme cosa alguna en el tintero, exceptuando la determinación geográfica de la isla, y esto tan solamente porque tengo por seguro que en ella existe todavía un tesoro no descubierto. Tomo la pluma en el año de gracia de 17-- y retrocedo hasta la época en que mi padre tenía aún la posada del "Almirante Benbow," y hasta el día en que por primera vez llegó á alojarse en ella aquel viejo marino de tez bronceada y curtida por los elementos, con su grande y visible cicatriz.

...and this is the file it saves with the replacements:

Imposible me ha sido rehusarme � las repetidas instancias que el Caballero Trelawney, el Doctor Livesey y otros muchos se�ores me han hecho para que escribiese la historia circunstanciada y completa de la Isla del Tesoro. Voy, pues, � poner manos � la obra cont�ndolo todo, desde el alfa hasta el omega, sin dejarme cosa alguna en el tintero, exceptuando la determinaci�n geogr�fica de la isla, y esto tan solamente porque tengo por seguro que en ella existe todav�a un tesoro no descubierto. Tomo la pluma en el a�o de gracia de 17-- y retrocedo hasta la �poca en que mi padre ten�a a�n la posada del "Almirante Benbow," y hasta el d�a en que por primera vez lleg� � alojarse en ella aquel viejo marino de tez bronceada y curtida por los elementos, con su grande y visible cicatriz.

IOW, the replacements are not happening - I'm just seeing the "mystery" character instead of the HTML codes.

I see the same thing at runtime when I step through the code and examine the individual lines of "linesModified" (I see �s). Better than seeing stars, I guess.

This is the process: it's a simple util where I click the button to open the (.txt) file. After processing, it saves the new version of the file to a new file.

UPDATE 2

Since it's possible to save explicitly as UTF8, I thought maybe doing so in reading the file may prove advantageous, but this:

while ((line = file.ReadLine(ASCIIEncoding.UTF8)) != null)

...doesn't compile, saying there is no overload of the ReadLine method that takes 1 argument.

Dharman
  • 30,962
  • 25
  • 85
  • 135
B. Clay Shannon-B. Crow Raven
  • 8,547
  • 144
  • 472
  • 862
  • 1
    So what is the question? Why `linesModified` still contains unreplaced characters? Or, why the html still displays the wrong symbol **despite** the `linesModified` list having the correctly escaped characters? – sstan Jun 15 '15 at 16:42
  • 3
    This function actually works fine for me just copying and pasting your code into a solution. Can you provide more details on what you are seeing and how you are calling it? – AngularRat Jun 15 '15 at 16:47
  • @sstan: Neither one - linesModified's accented characters *are* replaced, but they are replaced with " �" – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 17:04
  • 1
    Try putting a breakpoint before the save and looking at the replaced line in the debugger. I'd almost guess that whatever you are using to view the files is interpreting the html codes instead of displaying them. – AngularRat Jun 15 '15 at 17:04
  • Why would it interpret them as �s, and save it to file that way. And more importantly, how can I get it to "straighten up and fly right"? – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 17:05
  • 2
    How are you viewing the modified file? I only ask because I literally just copied and pasted your code and it works fine for me...so I'm GUESSING the issue is the viewer and not the code. But that's just a guess. – AngularRat Jun 15 '15 at 17:07
  • @AngularRat: I open the .txt file in Notepad. When I open it in Notepad, the "�" character is instead a square, or "box" – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 17:07
  • A shot in the dark, but what is the encoding on your C# source file? If you have an improper encoding, then it's possible that it isn't replacing the characters as expected, or at all actually. – sstan Jun 15 '15 at 17:09
  • @sstan: What do you mean by the encoding on my C# source file? I don't recall ever setting that anywhere. Where would I look to see what it was set to? – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 17:18
  • I CodeProjectified this here: http://www.codeproject.com/Tips/1001103/How-to-Convert-Accented-Characters-to-HTML-Codes – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 20:22

2 Answers2

1

Only thing I can think of is specifically specifying your encoding on the file write, like:

File.WriteAllLines(massagedFileName, linesModified, Encoding.UTF8);
AngularRat
  • 602
  • 3
  • 8
  • Good idea, but adding "Encoding.UTF8" to the WriteAllLines method made no difference. – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 17:19
  • 1
    Well, I'm out of ideas. I'll update if I think of anything else, but like I said in earlier comments, it all works for me just copying and pasting your code. Just out of curiosity, what version of windows are you running this in? I'm testing in a win7 environment, and notepad is smart enough to handle everything correctly there at least. – AngularRat Jun 15 '15 at 17:22
  • Yeah, I'm on Windows 7, also. Visual Studio 2010. I am curious about sstan's comment, though: is it really so that one can assign an encoding to the IDE itself? If so, how, I wonder... – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 17:30
  • 1
    For what it's worth, I tested in VS2013. I don't have 2010 on this PC, so can't test in that environment right now. Though that SHOULDN'T make a difference. What .Net version are you compiling to? – AngularRat Jun 15 '15 at 17:34
0

The answer by Jerome Laben here works - I just needed to change this line of code:

file = new StreamReader(fallName);

...to this:

file = new StreamReader(fallName, Encoding.Default, true);

...and now it works:

Imposible me ha sido rehusarme &aacute; las repetidas instancias que el Caballero Trelawney, el Doctor Livesey y otros muchos se&ntilde;ores me han hecho para que escribiese la historia circunstanciada y completa de la Isla del Tesoro. Voy, pues, &aacute; poner manos &aacute; la obra cont&aacute;ndolo todo, desde el alfa hasta el omega, sin dejarme cosa alguna en el tintero, exceptuando la determinaci&oacute;n geogr&aacute;fica de la isla, y esto tan solamente porque tengo por seguro que en ella existe todav&iacute;a un tesoro no descubierto. Tomo la pluma en el a&ntilde;o de gracia de 17-- y retrocedo hasta la &eacute;poca en que mi padre ten&iacute;a a&uacute;n la posada del "Almirante Benbow," y hasta el d&iacute;a en que por primera vez lleg&oacute; &aacute; alojarse en ella aquel viejo marino de tez bronceada y curtida por los elementos, con su grande y visible cicatriz.

Dharman
  • 30,962
  • 25
  • 85
  • 135
B. Clay Shannon-B. Crow Raven
  • 8,547
  • 144
  • 472
  • 862