Converting doc to txt and also convert the entities using c#?

Question

How do I convert a doc file with UTF-8 entity characters and automatically convert the the entities to its proper hexadecimal NCR sequence (e.x. ꯍ)

Below is a sample text from a doc file:

Isto é um teste. Eu não me importo com o que você pensa.
Você acha que me conhece muito bem.

After converting this to txt file the output should be:

Isto &#x00E9; um teste. Eu n&#x00E3;o me importo com o que voc&#x00EA; pensa.
Voc&#x00EA; acha que me conhece muito bem.

I did.

Document document = new Document();

    string docPath = @"C:\Users\Tamal\Desktop";
    document.LoadFromFile(Path.Combine(docPath,"op.docx"));
    document.SaveToFile(Path.Combine(docPath,"op.txt"), FileFormat.Txt);

    string readText = File.ReadAllText(Path.Combine(docPath,"op.txt"));
    System.Diagnostics.Process.Start(Path.Combine(docPath,"op.txt"));
    Console.ReadLine();

But this outputs the text file as (exactly the way the doc file is):

Isto é um teste. Eu não me importo com o que você pensa.
Você acha que me conhece muito bem.

How and where do I add the entity hexadecimal conversion?

NOTE: I am using Spire.Doc for converting doc to txt.

score 0 · Answer 1 · answered May 06 '18 at 14:38

0

Run your string through System.Net.WebUtility.HtmlEncode(string)

answered May 06 '18 at 14:38

Chris H

932
4
8

I replaced the last two lines of code by `string readText = File.ReadAllText(Path.Combine(docPath,"op.txt")); System.Net.WebUtility.HtmlEncode(readText); File.WriteAllText(Path.Combine(docPath,"op.txt"),readText); System.Diagnostics.Process.Start(Path.Combine(docPath,"op.txt"));`, but it does nothing? – Tamal Banerjee May 06 '18 at 14:43
something like this `string encodedString = System.Net.WebUtility.HtmlEncode(readText);` – Chris H May 06 '18 at 14:49
but the output shows `Isto é um teste. Eu não me importo com o que você pensa. Você acha que me conhece muito bem.` thats a different encoding – Tamal Banerjee May 06 '18 at 14:53
They should be equivalent. If you need the hex encoding it is more work. Take a look here. https://stackoverflow.com/questions/4663538/how-to-convert-unicode-character-to-its-escaped-ascii-equivalent-in-c-sharp – Chris H May 06 '18 at 14:58

Converting doc to txt and also convert the entities using c#?

1 Answers1