Rich Text to Plain Text via C#?

Question

I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).

Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?

The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:

        // Loop through each table in the document, 
        // grab only text from cells in the first column
        // in each table.
        foreach (Table tb in docs.Tables)
        {
            for (int row = 1; row <= tb.Rows.Count; row++)
            {
                var cell = tb.Cell(row, 1);
                var listNumber = cell.Range.ListFormat.ListString;
                var text = listNumber + " " + cell.Range.Text;

                dt.Rows.Add(text);
            }
        }

EDIT: Here is what the text ("1. Introduction") looks like in the Word document: enter image description here

This is what it looks like before being put into my datatable: enter image description here

And this is what it looks like when put into the datatable:

enter image description here

So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).

EDIT: Here is the code I'm trying to use. I created a new method to convert the string:

    private string ConvertToText(string rtf)
    {
        using (RichTextBox rtb = new RichTextBox())
        {
            rtb.Rtf = rtf;
            return rtb.Text;
        }
    }

When I run the program, it bombs with the following error: enter image description here

The variable rtf, at this point, looks like this: enter image description here

RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.

        // Loop through each table in the document, 
        // grab only text from cells in the first column
        // in each table.
        foreach (Table tb in docs.Tables)
        {
            for (int row = 1; row <= tb.Rows.Count; row++)
            {
                var charsToTrim = new[] { '\r', '\a', ' ' };
                var cell = tb.Cell(row, 1);
                var listNumber = cell.Range.ListFormat.ListString;
                var text = listNumber + " " + cell.Range.Text;
                text = text.TrimEnd(charsToTrim);
                dt.Rows.Add(text);
            }
        }

According to [the documentation for Range.Text](http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.range.text.ASPX), the text is the *plain, unformatted text of the selection or range*, so I'm not sure what formatting you're talking about. — Matthew Watson, Jul 23 '13 at 15:27
possible duplicate of http://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string — slfan, Jul 23 '13 at 15:27
@MatthewWatson When I get the text from the Word document, it looks like this "1. Introduction\r\a". In the Word document, it doesn't have the "\r" or "\a" characters visible. — Kevin, Jul 23 '13 at 15:44
@EhsanUllah See edits above. "rtf" contains "1. Introduction \r\a" at the time the error occurs. — Kevin, Jul 23 '13 at 16:06
richTextBox1.Text = test.TrimEnd(@"\r\a".ToCharArray()); this works — Ehsan, Jul 23 '13 at 16:12
@EhsanUllah Heh, I just simplified it and decided to go with: var charsToTrim = new[] { '\r', '\a', ' ' }; text = text.TrimEnd(charsToTrim); It seems that only the \r\a special characters were being used throughout the Word document, so I just trimmed those, per your suggestion and didn't need to use a RichTextBox to convert. Thanks! — Kevin, Jul 23 '13 at 16:36

score 2 · Answer 1 · answered Jul 23 '13 at 15:22

2

I don't know exactly what formatting you're trying to remove, but you could try something like:

text = text.Where(c => !Char.IsControl(c)).ToString();

That should strip the non-printing characters out.

answered Jul 23 '13 at 15:22

Andrew Coonce

1,557
11
19

score 1 · Answer 2 · answered Jul 23 '13 at 15:21

1

Why dont you give this a try:

using System;
using System.Text.RegularExpressions;

public class Example
{
    static string CleanInput(string strIn)
    {
        // Replace invalid characters with empty strings. 
        try {
           return Regex.Replace(strIn, @"[^\w\.@-]", "", 
                                RegexOptions.None, TimeSpan.FromSeconds(1.5)); 
        }
        // If we timeout when replacing invalid characters,  
        // we should return Empty. 
        catch (RegexMatchTimeoutException) {
           return String.Empty;   
        }
    }
}

Here's a link for it as well.

http://msdn.microsoft.com/en-us/library/844skk0h.aspx

answered Jul 23 '13 at 15:21

trueamerican420

211
2
10

Heh, this appears to work, but it also seems to remove spaces as well! If I can figure out why it's doing that, this will work. – Kevin Jul 23 '13 at 16:30
Try removing the \w. Not a hundred percent sure this will fix your problem but just experiment with the characters inside the []. Goodluck :) and dont forget to upclick the answer that worked for you! (doesn't have to be mine) haha – trueamerican420 Jul 23 '13 at 17:32

score 1 · Accepted Answer · answered Jul 23 '13 at 15:22

1

Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like

//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;

answered Jul 23 '13 at 15:22

Ehsan

31,833
6
56
65

This would be great, but I'm getting a "invalid format" when trying to put the string into rtBox.Rtf. I'm researching now why this is happening. – Kevin Jul 23 '13 at 15:51
can you give your exact exception? – Ehsan Jul 23 '13 at 15:53
Sure thing! I'll add it to the main question above. – Kevin Jul 23 '13 at 15:54

score 0 · Answer 4 · answered Jul 23 '13 at 15:22

0

Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.

answered Jul 23 '13 at 15:22

weismat

7,195
3
43
58

Rich Text to Plain Text via C#?

4 Answers4