How to read an HTML table and account for line breaks within cells

Question

I have an HTML table output from a program that separates values within a cell with <br>. I've tried using XML::readHTMLTable and htmltab but they glom together the values without any separators. I need them to be comma-separated, but I don't see any arguments to those functions to account for this. I've posted a psuedo example of the file below. Currently it reads into two vectors c("ABC","DEF","GHI") and c("JKLMNO","PQR","STU") but I need the "JKLMNO" element to instead be "JKL,MNO".

<table>
  <tr>
    <td>
      ABC<br/>
    </td>
    <td>
      DEF<br/>
    </td>
    <td>
      GHI<br/>
    </td>
  </tr>
  <tr>
    <td>
      JKL<br/>
      MNO<br/>
    </td>
    <td>
      PQR<br/>
    </td>
    <td>
      STU<br/
    </td>
  </tr>
</table>

possible duplicate ..http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package — user5249203, Aug 01 '16 at 20:46

score 0 · Answer 1 · answered Jul 25 '20 at 17:38

0

I had this problem with
in X being deleted by:

xTabs <- XML::readHTMLTable(X)

I fixed the problem as follows:

X1 <- gsub('<br/>', '\n', X)
xTabs <- XML::readHTMLTable(X1)

If I wanted '
', I could then do a find and replace in xTabs. However, I'm happier with '\n'.

answered Jul 25 '20 at 17:38

Spencer Graves

91
5

Sorry: I didn't notice that the "br" tags were suppressed. I had this problem with "<" "br" / ">" in X being deleted by XML::readHTMLTable(X). This was fixed by gsub, as I indicated. And I could have used gsub with xTabs if I had wanted to replace the '\n' everywhere with the br tag. – Spencer Graves Jul 25 '20 at 17:42

hrbrmstr · Accepted Answer · 2016-10-09T12:57:28.340

library(rvest)
library(dplyr)

doc <- read_html("<table>
  <tr>
    <td>
      ABC<br/>
    </td>
    <td>
      DEF<br/>
    </td>
    <td>
      GHI<br/>
    </td>
  </tr>
  <tr>
    <td>
      JKL<br/>
      MNO<br/>
    </td>
    <td>
      PQR<br/>
    </td>
    <td>
      STU<br/
    </td>
  </tr>
</table>")

tab <- html_table(doc)[[1]] 

mutate(tab, X1=gsub("[\r\n][[:space:]]+", ",", X1))
##        X1  X2  X3
## 1     ABC DEF GHI
## 2 JKL,MNO PQR STU

UPDATE

For folks who have HTML in a different format and may not feel up to the strain of posting, if you had, say:

doc <- read_html("<table>
  <tr>
    <td>ABC<br/></td>
    <td>DEF<br/></td>
    <td>GHI<br/></td>
  </tr>
  <tr>
    <td>JKL<br/>MNO<br/></td>
    <td>PQR<br/></td>
    <td>STU<br/</td>
  </tr>
</table>")

the aforementioned solution won't work because it's not the same data the OP had. I know…it's shocking.

If that is the case, copying and pasting a solution is definitely easier than typing a new question and you can use the following:

library(rvest)
library(dplyr)
library(purrr)

map(1:3, function(col) {
  html_nodes(doc, xpath=sprintf(".//tr/td[%d]", col)) %>% 
  map_chr(~paste0(html_nodes(., xpath=".//text()"), collapse=","))
}) %>% 
  set_names(sprintf("X%d", 1:3)) %>% 
  as_data_frame()

But — amazingly enough — if you had different tags and data in the TD tags or had to work with a more complex table structure, this solution would likely require adaptation as well. The mind, boggles.

While this does work as is, it doesn't actually really answer the question. The reason why JKL and MNO can be separated from each other is that they are separated by a newline (in addition to the br tag). The question was how to separate values separated by a br, and if there is only a br and no newline, this does not work. — Ilari Scheinin, Oct 09 '16 at 12:35
Apparently the OP thought differently. Feel free to post your own answer or start a new question if you have a similar situation this solution isn't working for. — hrbrmstr, Oct 09 '16 at 12:42

How to read an HTML table and account for line breaks within cells

2 Answers2