Extract html table into R but skip specific elements (
) in one certain column that has three lines/rows, in a multi-page html

Question

I want to load the table in a link into a dataframe in R.

The following script successfully extracts the table:

install.packages("htmltab")
library(htmltabl)
url <- "http://www.hmdb.ca/metabolites?utf8=✓&filter=true&toxin=1&filter=true"
hm_ext <- htmltab(url)

But here is the problem:

> head(hm_ext)
     HMDB ID CAS Number        Name FormulaAverage Mass Monoisotopic Mass                                           Biospecimen Location
2    HMDB000014264-18-6 Formic acid              CH2O246.025446.005479308 BloodBreast MilkCerebrospinal Fluid (CSF)FecesSalivaSweatUrine
3   HMDB0000228108-95-2      Phenol              C6H6O94.111294.041864814                                     BloodFecesSalivaSweatUrine
4 HMDB000059818496-25-8     Sulfide                    S32.06531.97207069                                                          Blood
5  HMDB00005997440-47-3    Chromium                 Cr51.996151.940511904                      BloodCerebrospinal Fluid (CSF)SalivaUrine
6  HMDB00006577440-50-8      Copper                  Cu63.54662.929601079                      BloodCerebrospinal Fluid (CSF)SalivaUrine
7  HMDB00006627782-41-4    Fluoride                  F18.998418.998403205                                               BloodSalivaUrine

The third column, i.e. Formula Average Mass Monoisotopic Mass, has three rows/values/lines and they are all mixed together and thus showing up as one continuous string. I only want to extract the first line/row in this column OR somehow separate the three values from one another.

Here is how the first cell of the third column looks like in the rendered html page:

third column screenshot

Same thing happens if I use XML::readHTMLTable.

When I click on Inspect Element in Chrome, I can see such a structure for a cell in the Formula Average Mass Monoisotopic Mass column:

<td class="weight-value">CH<sub>2</sub>O<sub>2</sub><br><br>46.0254<br>46.005479308</td>

image -- please embed it. I do not have enough reputation to do that

However, there are also other times when the second and third lines in this column are empty. Example:

<td class="weight-value">(C<sub>12</sub>H<sub>19</sub>NO<sub>19</sub>S<sub>3</sub>)nH<sub>2</sub>O<br><br><span class="wishart wishart-not-available">Not Available</span><br></td>

image -- Inspect Element screenshot

So how can I extract the table from the given link but keep the structure of the third column readable and not mixed up? And furthermore, is it possible to extract the tables in all pages without looping through the links for each separate page?

I am not sure what exactly you need. But I think you want three seperate values for column no 3. Give us your expected output table to better answer this. However from first look I think you should look for `sub` command. See `?sub` and `?pattern` — SamAct, Jun 18 '18 at 09:22
If you take a look at the link using your browser, you will notice that column `FormulaAverage Mass Monoisotopic Mass` has exactly three values each in a separate line. When loading the html table into R, all of the lines get coerced into one string. — solmazutre, Jun 18 '18 at 09:33
What link? I can see two images which have more then 3 values. — SamAct, Jun 18 '18 at 09:36
I am interested in only keeping the string that is on the first line, but if keeping only the first one is not possible, then I want to have three but at least separated in a way that is human-readable and not all mixed up and confused. In the current state it is not possible to use `sub` because it is not possible to understand how the three lines are separated, i.e. they have no separator at all, the separator is `""` which makes it impossible to retrieve the original values. — solmazutre, Jun 18 '18 at 09:37
The link is in the first chunk of code in my post. `url <- "http://www.hmdb.ca/metabolites?utf8=✓&filter=true&toxin=1&filter=true"` — solmazutre, Jun 18 '18 at 09:38
You have this: `CH2O246.025446.005479308` you can use pattern as the `dot` and seperate them. — SamAct, Jun 18 '18 at 09:39
@SamAct No. That is incorrect. The `dot` comes from the decimal number on the second and third lines and there is no way to find out what the original value. Why? Here's an example: take `CH2O246.025446.005479308`, what was the value on the first line? Was it `CH2O`? Or was it `CH2O2`? Or was it `CH2O24`? `dot` is meaningless here, it belongs to the decimal number and is not a separator. — solmazutre, Jun 18 '18 at 10:35

score 1 · Accepted Answer · answered Jun 18 '18 at 13:30

One of the approach could be

library(rvest)
library(qdapRegex)
library(XML)

#read webpage
htm_data <- read_html("http://www.hmdb.ca/metabolites?tf8=%E2%9C%93&filter=true&toxin=1&filter=true") 

#convert above webpage's table into a dataframe
df <- html_table(html_nodes(htm_data, "table"))[[1]]

#cleanup data in the required column
df[, 4] <- unlist(lapply(rm_between(xml_find_all(htm_data, "//table/tbody/tr/td[4]"), 
                                    ">", 
                                    "<br><br>", extract=TRUE), 
                         function(x) gsub("<.*?>", "", x[[1]])))

which gives

> head(df)
    HMDB ID  CAS Number        Name Structure FormulaAverage Mass Monoisotopic Mass
1    HMDB000014264-18-6 Formic acid        NA                                 CH2O2
2   HMDB0000228108-95-2      Phenol        NA                                 C6H6O
3 HMDB000059818496-25-8     Sulfide        NA                                     S
4  HMDB00005997440-47-3    Chromium        NA                                    Cr
5  HMDB00006577440-50-8      Copper        NA                                    Cu
6  HMDB00006627782-41-4    Fluoride        NA                                     F
                                            Biospecimen Location
1 BloodBreast MilkCerebrospinal Fluid (CSF)FecesSalivaSweatUrine
2                                     BloodFecesSalivaSweatUrine
3                                                          Blood
4                      BloodCerebrospinal Fluid (CSF)SalivaUrine
5                      BloodCerebrospinal Fluid (CSF)SalivaUrine
6                                               BloodSalivaUrine

Very clever. Thanks a lot. I wouldn't have been able to figure this out in any way by myself. Great answer. Everything works perfectly. I only receive this warning: `In stringi::stri_extract_all_regex(text.var, pattern) : argument is not an atomic vector; coercing` but I assume it is nothing important, because I do not see anything wrong in the extracted data table. — solmazutre, Jun 19 '18 at 13:22
Glad that it helped! Btw I am not very sure about the cause of error message mentioned in your comment as I can't reproduce it but I guess you may want to try `stringi::stri_extract_all_regex(text.var[[1]], pattern)`. — Prem, Jun 19 '18 at 18:43

Extract html table into R but skip specific elements () in one certain column that has three lines/rows, in a multi-page html

1 Answers1

Extract html table into R but skip specific elements (
) in one certain column that has three lines/rows, in a multi-page html