How to programmatically find the start of the text content within rich text?

Question

I'm trying to create a program which reads text as rich text, and outputs it using Markdown. I've copied the following paragraph into a RichTextBox (emphasis preserved from original)

A necessary component of narratives and story-telling. When an author of a story (be it a writer, speaker, film-maker or otherwise,) conveys a story to their audience, the audience is allowed to construct an internal representation of the world in which the story takes place (the “story world”). How the audience does this is dependent on which aspects of the world the author chooses to explicitly include in the narrative, such as the characters and characterisation, the settings and their descriptions, and information about the story world which the audience might not know.

And when I read the RichTextBox.Rtf property, it looks like this (emphasis added for demonstration):

{\rtf1\fbidis\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fswiss\fprq2\fcharset0 Arial;}{\f1\froman\fprq2\fcharset0 Times New Roman;}} {\colortbl ;\red0\green0\blue0;} \viewkind4\uc1\pard\ltrpar\cf1\f0\fs22 A necessary component of \b narratives and story-telling\b0 . When an \b author\b0 of a story (be it a writer, speaker, film-maker or otherwise,) conveys a story to their audience, the \b audience \b0 is allowed to construct an internal representation of the world in which the story takes place (the \ldblquote story world\rdblquote ). How the audience does this is dependent on which aspects of the world the author chooses to explicitly include in the narrative, such as the characters and characterisation, the settings and their descriptions, and information about the story world which the audience might not know.\cf0\f1\fs24\par \pard\ltrpar\sa160\sl252\slmult1\fs22\par \pard\ltrpar\cf1\f0\par }

I want to extract the text content from this Rtf string - I'm not interested in the bits of code before and after the Rtf, all I want to know about is bold, italic and other formatting. I'm trying to work out how to determine where the text starts for any such given paragraph, though.

As a human, I obviously know where the text starts - right after the section I've bolded. I don't know how to tell my program what to look for though. I'm pretty sure the rtf code at the start of the paragraph is different for every paragraph, so I can't just tell my program to find this particular code and delete it.

Something else I thought of was searching for the first n characters in the original paragraph within the outputted rtf, like searching for "A necessary component". But if any of those first words is bolded, it won't look the same in the rtf output, so that approach won't work consistently either.

I'm sure I'm missing an obvious solution, but if anyone knows how I can cleverly work out where my text content starts and ends, I'd be glad.

I'm using VB.NET in Winforms, so would prefer an answer in VB.NET or pseudocode.

Why an off-topic close vote? This is a question about programming which can be answered using programming. Also, why the drive-by downvote? If there's something that can be improved about this question, feel free to input and I will attempt to improve the question. — Lou, Jan 23 '20 at 10:34
You could look at the source code of the utilities mentioned in the answer to [How do I convert an RTF string to a Markdown string...](https://stackoverflow.com/q/46119392/1115360). — Andrew Morton, Jan 23 '20 at 10:52
Thanks, that led me to [this article](https://www.codeproject.com/Articles/51879/Converting-RTF-to-HTML-in-VB-NET-the-Easy-Way#_comments) which basically gave me a function for converting RTF to HTML, which is half the battle. Now it should be easier to parse the HTML into markdown, hopefully ... — Lou, Jan 23 '20 at 11:48

score 0 · Answer 1 · answered Jan 23 '20 at 12:35

Well, it's super janky, but I've got the solution to my problems.

I found this article which has a complete function written in VB.NET to convert RTF to HTML.

Then I just did this, which takes the resulting HTML output from that function and converts it to markdown. So far it works perfectly.

    If InputRTB.Text <> "" Then
        Dim input As String = InputRTB.Text
        Dim output As String = ""

        output = sRTF_To_HTML(InputRTB.Rtf)

        output = output.Substring(output.IndexOf("<span style"))
        output = output.Substring(output.IndexOf(">") + 1)
        Dim endpos = output.IndexOf("</span>")
        output = output.Remove(endpos, output.Length - endpos)


        Dim foundAllBold As Boolean = False
        Dim boldWords As New List(Of String)
        Do
            If output.Contains("<b>") Then
                Dim startb = output.IndexOf("<b>")
                Dim endb = output.IndexOf("</b>")
                Dim word = Trim(output.Substring(startb + 3, endb - startb - 3))
                If word <> "" Then
                    Dim wordArray() As Char = word.ToCharArray
                    wordArray(0) = Char.ToUpper(wordArray(0))
                    word = New String(wordArray)
                End If

                boldWords.Add(word)
                output = Replace(output, "<b>", "**", , 1)
                output = Replace(output, "</b>", "**", , 1)
            Else
                foundAllBold = True
            End If
        Loop Until foundAllBold = True

        output = output.Replace(vbCrLf, " ")

        OutputRTB.Text = output

        WordListRTB.Clear()

        For Each b As String In boldWords
            WordListRTB.AppendText(b & vbCrLf)
        Next

        Clipboard.SetText(OutputRTB.Text)
        MsgBox("Copied output to clipboard")

    End If

How to programmatically find the start of the text content within rich text?

1 Answers1