1

I am working on a test project using C# in Visual Studio 2019 Community edition.

I have a book in .rtf format. The chapter numbers are in Times New Roman 16 font.

Each chapter's verses are numbered in Arial 12 font.

I want to be able to programmatically remove each chapter's verse numbers, i.e. remove the Arial 12 numbering of the verses, i.e. Chapter 1 (1 blah blah, 2 blah, 3 blah < remove those 12 pt numbers, while leaving each chapter's numbers (the Times New Roman 16).

I am creating a text-to-speech App, which will read the .rtf book. I don't want each verse number to be read, just the chapter number, followed by the text.

Can anyone suggest how to interate through the document and remove the numbered fonts using either Word Interop, Regex, or other method?

Here is a sample of what I have tried so far, without success.

public string LocateFont(string myBook)
{
    if (myBook.ToString() == "xxxxxx")
    {
        if (rtbox1.Text != "")
        {
            Microsoft.Office.Interop.Word.Application wordApp = new Microsoft.Office.Interop.Word.Application();
            Document myDoc = wordApp.Documents.Open(rtbox1.Text);
            Microsoft.Office.Interop.Word.Range range = myDoc.Range(0, myDoc.Content.End);
                
            Regex reNum = new Regex(@"^\d+$");
            bool isNumeric = reNum.Match(rtbox1.Text).Success;
            if (isNumeric.Equals(true) & range.Find.Font.Name == "Arial" & range.Font.Size.Equals("12"))
            {
                range.Font.Equals("");
            }

        }
    }
        return rtbox1.Text.ToString();
}

OK, after reworking the code, I was finally able to get visual studio to recognize macropod's code suggestion, as shown below.

private void tsbtnDelFont_Click(object sender, EventArgs e)
{
    Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
    Microsoft.Office.Interop.Word.Document doc = new Document();
    doc.Application.Documents.Add(rtbox1.Text);
    word.ScreenUpdating = false;
    Microsoft.Office.Interop.Word.Range range = word.ActiveDocument.Content;
        
    var tempVar = range.Find;
    tempVar.ClearFormatting();
    tempVar.Font.Size = 12;
    tempVar.Font.Name = "Arial";
    tempVar.Text = "<[0-9]@>";
    tempVar.Replacement.Text = "";
    tempVar.MatchWildcards = true;
    tempVar.Wrap = WdFindWrap.wdFindContinue;
    tempVar.Execute(Microsoft.Office.Interop.Word.WdReplace.wdReplaceAll);
    object filename = @"Path to my.rtf";
    doc.SaveAs2(ref filename);
    word.ScreenUpdating = true;
}

The code compiles, I populate the RichTextBox with my .rtf document, but when I run the code that is supposed to remove the Arial 12 point fonts, I receive a COM error that the RichTextBox String is longer than 255 characters.

Can anyone suggest how to get past this impasse?


After researching this issue further, I discovered that there is an inherit 255-character limit, which cannot be easily overcome. I found an interesting article that states that

Even find strings constructed using wildcards are limited to 255 characters.

My document contains over 40,000 words, which greatly exceeds 255 characters, and Words own Find cannot overcome this.

The article Find & Replace (w\ Long Strings) offers a VBA solution for overcoming this 255-character limitation. However, I tried unsuccessfully to incorporate macropod's code within the article's solution. If macropod, or someone else, after reading the article, can marry his code with the solution offered in the article, that would be great.

However, until then, I used a Find and Replace using the following special characters. ^#. This removed the Arial 12 point numeric fonts. However, it also removed the chapter numbers, which I will need to replace.


Here is the VBA macro I created inside of the rich text document, based on your suggestion. However, upon running this code, it does Nothing at all. It doesn't remove any Arial 12 point numbered verses at all.

Sub RemoveNumericVerses()
'
' RemoveNumericVerses Macro
'
'
Application.ScreenUpdating = False
With ActiveDocument.Range.Find
    .ClearFormatting
    .Font.Size = 12
    .Font.Name = "Arial"
    .Text = "<[0-9]@>"
    .Replacement.Text = ""
    .MatchWildcards = True
    .Wrap = wdFindContinue
    .Execute Replace:=wdReplaceAll
End With
Application.ScreenUpdating = True

End Sub

I'm still open to suggestions, but creating and executing the aforementioned VBA macro does not work.


Joel Coehoorn, here is the excerpt that you asked for. If the formatting carries over, you should see the chapter numbers in Times New Roman 16 font, and the verses in Arial 12 font.

31 There they buried Abraham and his wife Sarah. There they buried Isaac and his wife Rebekʹah, and there I buried Leʹah. 32 The field and the cave that is in it were purchased from the sons of Heth.” 33 Thus Jacob finished giving these instructions to his sons. Then he drew his feet up onto the bed and breathed his last and was gathered to his people. Chapter 50 1 Joseph then threw himself on his father and wept over him and kissed him. 2 After that Joseph commanded his servants, the physicians, to embalm his father. So the physicians embalmed Israel, 3 and they took the full 40 days for him, for this is the full period for the embalming, and the Egyptians continued to shed tears for him 70 days.

CodeMann
  • 157
  • 9
  • This requires nothing more complex than a *wildcard* Find/Replace with the font name & size specified as Find attributes. What have you tried? Post your code. – macropod Jan 01 '23 at 00:48
  • @macropod, I would appreciate seeing an example in code, for what you suggest. I have included a screen shot of what I've tried so far, but even my method may not be the best way to do what I'm trying to accomplish. – CodeMann Jan 01 '23 at 15:17
  • I don't code in C#, only in VBA. I could post the pretty trivial VBA code for you to adapt. Way faster than looping through all paragraphs then using Regex & font testing on each match... – macropod Jan 02 '23 at 01:21
  • **Do not post images of code!** – Joel Coehoorn Jan 02 '23 at 01:56
  • @macropod. As I stated in my earlier comments, I would like to see your vba code, which you intimate can address my needs. Please do so, as I can convert your code to C#. However, please provide a complete example, demonstrating your solution. Bear in mind, this needs to be able to address either a .rtf document, or Microsoft Word, and remove numeric fonts. – CodeMann Jan 02 '23 at 20:44
  • @macropod. I have managed to get all of your code suggestion recognized by Visual Studio, with the exception of the line 'tempVar.Execute Replace:wdReplaceAll;' VS doesn't recognize the next to last line, as shown in your example. Above, I have shown my efforts to implement your code. Perhaps I'm missing something?? – CodeMann Jan 04 '23 at 23:29
  • Can you post an excerpt of the RTF markup? Ideally it would include the last few words of one chapter and the start of the next into at least the third verse – Joel Coehoorn Jan 24 '23 at 22:19
  • Joel Coehoorn, I have supplied the requested excerpt from my rtf document above. I'm not sure if the formatting carried over when I posted it or not, as I chose a block quote to supply it in. – CodeMann Jan 26 '23 at 19:43

1 Answers1

0

For example, with VBA:

Sub Demo()
Application.ScreenUpdating = False
With ActiveDocument.Range.Find
  .ClearFormatting
  .Font.Size = 12
  .Font.Name = "Arial"
  .Text = "<[0-9]@>"
  .Replacement.Text = ""
  .MatchWildcards = True
  .Wrap = wdFindContinue
  .Execute Replace:=wdReplaceAll
End With
Application.ScreenUpdating = True
End Sub
macropod
  • 12,757
  • 2
  • 9
  • 21
  • I have managed to get Visual Studio to recognize all of your code suggestion, with the exception of the line reading 'tempVar.Execute Replace:wdReplaceAll;' I have posted above, my attempt at using your code, but need assistance in getting the problematic line working. Any suggestions? – CodeMann Jan 04 '23 at 23:37
  • See: https://stackoverflow.com/questions/19252252/c-sharp-word-interop-find-and-replace-everything. It seems you need to qualify wdReplaceAll with something like Word.WdReplace.wdReplaceAll – macropod Jan 04 '23 at 23:58
  • Using the suggestions found online, I revised my code. However, now it removes the contents of the entire document, instead of just the Numbered Verses. I posted my revised code above. – CodeMann Jan 05 '23 at 21:52
  • Although I don't code in c#, I can see that your isNumeric test should not be in the code - the Find expression finds only numeric strings. All you need is the: findObject.Replacement.Text = ""; – macropod Jan 05 '23 at 22:01
  • I removed my isNumeric test, but the code still removes the entire contents of the document. I've been trying various Regex combinations and come close with [A-Za-z(\d\W)], which correctly leaves the chapter number. However, it still leaves the verse numbers, which is what I want to remove. – CodeMann Jan 06 '23 at 20:31
  • Using the expression [^0-9]+ comes even closer. It successfully removes All the verse numbers. However, it also removes the chapter numbers as well. I need a hybrid, which leaves the chapter numbers but removes the verse numbers. – CodeMann Jan 06 '23 at 22:20
  • RegEx cannot do the job on its own, because RegEx ignores formatting. A *wildcard* Find/Replace (which uses a Word-specific version of RegEx) can be written to take account of formatting and a whole host of other conditions. If you run the VBA code I posted on your document, you will see that only the 12pt Arial numbers get deleted. – macropod Jan 06 '23 at 23:58
  • Please examine the 'SearchReplace()' code I posted above, and tell me if you believe it implements your VBA example. I ask, because when I run that code, it deletes the ENTIRE contents of the document, not just the 12pt Arial numbers. If the above code does follow your VBA example, then please suggest why it deletes everything. – CodeMann Jan 08 '23 at 14:55
  • Clearly, your c# code does not correctly implement the conversion of the VBA code I posted, since the VBA code only deletes 12pt Arial numerals. Since I don't code in c#, I am unable to say what particular changes you need to make. – macropod Jan 09 '23 at 20:13
  • I reworked the code (see example above) and visual studio recognizes all of macropod's code. However, it throws a COMException of 'String is longer than 255 characters' on RichTextBox.Text. Can anyone suggest how to get past this issue? Perhaps bypassing the RichTextBox and applying the code to the .rtf document directly? – CodeMann Jan 13 '23 at 19:25
  • Macropod, please see the results of my research above and the article that I reference. I appreciate your code suggestion, but the article makes clear why I've run into difficulty trying to use it. If, after reading the article, you can incorporate your code within the article's solution, I would greatly appreciate it. – CodeMann Jan 21 '23 at 21:57
  • The 255-character limitation has nothing to do with your problem. That concerns only the length of the Find/Replace strings and your Find text is waaay less than 255 characters in length! – macropod Jan 22 '23 at 20:26
  • macropod, please see my VBA macro code shown above. When I use Alt + F8 to create and run the code, it does Not work. Is this not the code suggestion you offered? Can you tell me why it does not remove anything? – CodeMann Jan 24 '23 at 21:55
  • The VBA code I posted *does* work on numbers formatted as per your specifications. If you're not getting the expected results when you run it, your specifications are wrong. – macropod Jan 24 '23 at 22:12