Which R package can parse a string of HTML to tell me which words of the text are bold, italic etc?

Question

I have a dataframe containing a column of HTML. Each entry in the column is a paragraph of HTML. For example:

html <- "<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>"

I need to determine whether each paragraph of HTML is bold, italic, underlined etc. Many of my paragraphs have some parts emboldened and some parts not, like the one above (which is all italic, but only the number 55 is bold), so I'd apply a rule - if, say, 50% or more of the text of the HTML is emboldened, I'll flag it as bold.

I have no idea where to start. A good start would be to know which R package I should be trying to use (and, of course, if anyone can actually help me solve my problem using that package that would be even better!) Thanks

The `rvest` package is used for scraping, I'm guessing it can be used for this somehow. — r2evans, Mar 06 '21 at 22:28
`xml2` might have some power here, too, though I'm stretching on that one. They may be the only packages that are geared to parse html (I might be missing some). And before you or anybody else suggests regex, it is possible, but ... really discouraged (https://stackoverflow.com/a/1732454/3358272). — r2evans, Mar 06 '21 at 22:58

Waldi · Accepted Answer · 2021-03-07T14:29:22.900

2

You could use rvest and search for the b tag as explained here:

library(rvest)
html <- minimal_html("
  <ul>
    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
    <li><b>R4-P17</b> is a <i>droid</i></li>
  </ul>
  ")

html %>% html_nodes("b")

{xml_nodeset (4)}
[1] <b>C-3PO</b>
[2] <b>R2-D2</b>
[3] <b>Yoda</b>
[4] <b>R4-P17</b>

Note that for rvest 0.3.6 you should use html_node. The upcoming version will use html_element.

To use this on a dataframe :

library(purrr)
purrr::pmap(df,~with(list(...), {raw %>% read_html %>% html_nodes('b')}))

edited Mar 07 '21 at 14:29

answered Mar 07 '21 at 09:51

Waldi

39,242
6
30
78

1

`html_nodes("b, strong")` – QHarr Mar 07 '21 at 13:21
Thanks both. Very helpful. How do I apply this to a dataframe column 'raw', to give me a column of emboldened words? `bolded <- filing_df %>% rowwise() %>% mutate(is_bold = read_html(raw) %>% html_nodes("b, strong")) ` isn't quite right. – ks123321 Mar 07 '21 at 13:44
see my edit, for a df. – Waldi Mar 07 '21 at 14:34

Which R package can parse a string of HTML to tell me which words of the text are bold, italic etc?

1 Answers1