I have a dataframe containing a column of HTML. Each entry in the column is a paragraph of HTML. For example:
html <- "<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>"
I need to determine whether each paragraph of HTML is bold, italic, underlined etc. Many of my paragraphs have some parts emboldened and some parts not, like the one above (which is all italic, but only the number 55 is bold), so I'd apply a rule - if, say, 50% or more of the text of the HTML is emboldened, I'll flag it as bold.
I have no idea where to start. A good start would be to know which R package I should be trying to use (and, of course, if anyone can actually help me solve my problem using that package that would be even better!) Thanks