1

I have a program that reads through a Microsoft Word 2010 document and puts all text read from the first column of every table into a datatable. However, the resulting text also includes special formatting characters (that are usually invisible in the original Word document).

Is there a way that I can take the string of text that I've read and strip all the formatting characters from it?

The program is pretty simple, and uses the Microsoft.Office.Interop.Word assemblies. Here is the main loop where I'm grabbing the text from the document:

        // Loop through each table in the document, 
        // grab only text from cells in the first column
        // in each table.
        foreach (Table tb in docs.Tables)
        {
            for (int row = 1; row <= tb.Rows.Count; row++)
            {
                var cell = tb.Cell(row, 1);
                var listNumber = cell.Range.ListFormat.ListString;
                var text = listNumber + " " + cell.Range.Text;

                dt.Rows.Add(text);
            }
        }

EDIT: Here is what the text ("1. Introduction") looks like in the Word document: enter image description here

This is what it looks like before being put into my datatable: enter image description here

And this is what it looks like when put into the datatable:

enter image description here

So, I'm trying to figure out a simple way to get rid of the control characters that seem to be appearing (\r, \a, \n, etc).

EDIT: Here is the code I'm trying to use. I created a new method to convert the string:

    private string ConvertToText(string rtf)
    {
        using (RichTextBox rtb = new RichTextBox())
        {
            rtb.Rtf = rtf;
            return rtb.Text;
        }
    }

When I run the program, it bombs with the following error: enter image description here

The variable rtf, at this point, looks like this: enter image description here

RESOLUTION: I trimmed the unneeded characters before writing them to the datatable.

        // Loop through each table in the document, 
        // grab only text from cells in the first column
        // in each table.
        foreach (Table tb in docs.Tables)
        {
            for (int row = 1; row <= tb.Rows.Count; row++)
            {
                var charsToTrim = new[] { '\r', '\a', ' ' };
                var cell = tb.Cell(row, 1);
                var listNumber = cell.Range.ListFormat.ListString;
                var text = listNumber + " " + cell.Range.Text;
                text = text.TrimEnd(charsToTrim);
                dt.Rows.Add(text);
            }
        }
Kevin
  • 4,798
  • 19
  • 73
  • 120
  • What chars do you need stripping? – It'sNotALie. Jul 23 '13 at 15:25
  • According to [the documentation for Range.Text](http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.range.text.ASPX), the text is the *plain, unformatted text of the selection or range*, so I'm not sure what formatting you're talking about. – Matthew Watson Jul 23 '13 at 15:27
  • possible duplicate of http://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string – slfan Jul 23 '13 at 15:27
  • @MatthewWatson When I get the text from the Word document, it looks like this "1. Introduction\r\a". In the Word document, it doesn't have the "\r" or "\a" characters visible. – Kevin Jul 23 '13 at 15:44
  • @JavaRox What is the text in your string rtf? – Ehsan Jul 23 '13 at 16:01
  • @EhsanUllah See edits above. "rtf" contains "1. Introduction \r\a" at the time the error occurs. – Kevin Jul 23 '13 at 16:06
  • @JavaRox Where is the RTF Formatting though? – Ehsan Jul 23 '13 at 16:08
  • @EhsanUllah Okay, I'll try that now. – Kevin Jul 23 '13 at 16:08
  • 1
    richTextBox1.Text = test.TrimEnd(@"\r\a".ToCharArray()); this works – Ehsan Jul 23 '13 at 16:12
  • @EhsanUllah Heh, I just simplified it and decided to go with: var charsToTrim = new[] { '\r', '\a', ' ' }; text = text.TrimEnd(charsToTrim); It seems that only the \r\a special characters were being used throughout the Word document, so I just trimmed those, per your suggestion and didn't need to use a RichTextBox to convert. Thanks! – Kevin Jul 23 '13 at 16:36
  • @JavaRox glad i was helpfull – Ehsan Jul 23 '13 at 16:37

4 Answers4

2

I don't know exactly what formatting you're trying to remove, but you could try something like:

text = text.Where(c => !Char.IsControl(c)).ToString();

That should strip the non-printing characters out.

Andrew Coonce
  • 1,557
  • 11
  • 19
1

Why dont you give this a try:

using System;
using System.Text.RegularExpressions;

public class Example
{
    static string CleanInput(string strIn)
    {
        // Replace invalid characters with empty strings. 
        try {
           return Regex.Replace(strIn, @"[^\w\.@-]", "", 
                                RegexOptions.None, TimeSpan.FromSeconds(1.5)); 
        }
        // If we timeout when replacing invalid characters,  
        // we should return Empty. 
        catch (RegexMatchTimeoutException) {
           return String.Empty;   
        }
    }
}

Here's a link for it as well.

http://msdn.microsoft.com/en-us/library/844skk0h.aspx

trueamerican420
  • 211
  • 2
  • 10
  • Heh, this appears to work, but it also seems to remove spaces as well! If I can figure out why it's doing that, this will work. – Kevin Jul 23 '13 at 16:30
  • Try removing the \w. Not a hundred percent sure this will fix your problem but just experiment with the characters inside the []. Goodluck :) and dont forget to upclick the answer that worked for you! (doesn't have to be mine) haha – trueamerican420 Jul 23 '13 at 17:32
1

Al alternative can be that You need to add a rich textbox in your form (you can keep it hidden if you don't want to show it) and when you have read all your data just assign it to the richtextbox. Like

//rtfText is rich text
//rtBox is rich text box
rtBox.Rtf = rtfText;
//get simple text here.
string plainText = rtBox.Text;
Ehsan
  • 31,833
  • 6
  • 56
  • 65
0

Totally different approach would be to look at the Open Office XML SDK.
This example should get you started.

weismat
  • 7,195
  • 3
  • 43
  • 58