Python finding bolded text in RTF

Question

I'm dealing with a gigantic rich text file where every entry starts with a bold title. It'd be really helpful to import the rich text file into Python and have it split up lines wherever it sees bold text. However, I can't find a way to import non plaintext, and have resorted to looking for other methods of finding where the bold text starts.

Is there a way to get Python to read where bold text is?

score 0 · Answer 1 · edited May 23 '17 at 12:16

No, not easily. Certainly not within the scope of a StackOverflow answer.

The problem is that RTF is a proprietary format, with special "syntax" that describes the format.

There are libraries that make attempts to read it, which are described here: Is there a Python module for converting RTF to plain text?

However, even if one of those would read the text for you, it would be unlikely to be telling you the format. Afterall, how would it tell you?

Your best bet may be finding an RTF to HTML converter (at least one is referred to in the question that I pointed to), then using BeautifulSoup to find the bolded HTML elements.

score 0 · Answer 2 · answered Jun 07 '15 at 13:34

0

According to Wikipedia...

{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
This is some {\b bold} text.\par
}

If you want to split into new lines, I think you could do .replace('{\\b ', '\n') and be most of the way there. Switch to regex replacement if you also want to drop the other }

answered Jun 07 '15 at 13:34

mike.k

3,277
1
12
18

1

rtf bold statements can also take the form `This is some \b bold\b0 text.`, so depending on what generated this particular rtf, you might miss some or all of the bold headers. – Eric Appelt Jun 07 '15 at 13:38

Python finding bolded text in RTF

2 Answers2