I'm trying to create a program which reads text as rich text, and outputs it using Markdown. I've copied the following paragraph into a RichTextBox (emphasis preserved from original)
A necessary component of narratives and story-telling. When an author of a story (be it a writer, speaker, film-maker or otherwise,) conveys a story to their audience, the audience is allowed to construct an internal representation of the world in which the story takes place (the “story world”). How the audience does this is dependent on which aspects of the world the author chooses to explicitly include in the narrative, such as the characters and characterisation, the settings and their descriptions, and information about the story world which the audience might not know.
And when I read the RichTextBox.Rtf property, it looks like this (emphasis added for demonstration):
{\rtf1\fbidis\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fswiss\fprq2\fcharset0 Arial;}{\f1\froman\fprq2\fcharset0 Times New Roman;}} {\colortbl ;\red0\green0\blue0;} \viewkind4\uc1\pard\ltrpar\cf1\f0\fs22 A necessary component of \b narratives and story-telling\b0 . When an \b author\b0 of a story (be it a writer, speaker, film-maker or otherwise,) conveys a story to their audience, the \b audience \b0 is allowed to construct an internal representation of the world in which the story takes place (the \ldblquote story world\rdblquote ). How the audience does this is dependent on which aspects of the world the author chooses to explicitly include in the narrative, such as the characters and characterisation, the settings and their descriptions, and information about the story world which the audience might not know.\cf0\f1\fs24\par \pard\ltrpar\sa160\sl252\slmult1\fs22\par \pard\ltrpar\cf1\f0\par }
I want to extract the text content from this Rtf string - I'm not interested in the bits of code before and after the Rtf, all I want to know about is bold, italic and other formatting. I'm trying to work out how to determine where the text starts for any such given paragraph, though.
As a human, I obviously know where the text starts - right after the section I've bolded. I don't know how to tell my program what to look for though. I'm pretty sure the rtf code at the start of the paragraph is different for every paragraph, so I can't just tell my program to find this particular code and delete it.
Something else I thought of was searching for the first n characters in the original paragraph within the outputted rtf, like searching for "A necessary component". But if any of those first words is bolded, it won't look the same in the rtf output, so that approach won't work consistently either.
I'm sure I'm missing an obvious solution, but if anyone knows how I can cleverly work out where my text content starts and ends, I'd be glad.
I'm using VB.NET in Winforms, so would prefer an answer in VB.NET or pseudocode.