1

I'm having trouble with web scraping a table from ClinicalTrials.gov.

I'm trying to extract the CSS selector of the words in the first column of the first row, labeled "breast cancer", under the Terms and Synonyms Searched table. Here is the link to the table: https://clinicaltrials.gov/ct2/results/details?cond=breast+cancer

Please see below screenshot for the terms I want:

enter image description here

The CSS selector, .w3-padding-8:nth-child(1) gets me all the terms in the first column. This works if the search term is a single word, like "pembrolizumab", but if the search term is two words, like "breast cancer", the table contains multiple rows ("chunks") and the above CSS selector returns all the terms from these rows.

EDIT: Here is the code, as @neilfws suggested:

search_term_processed <- unlist(stringr::str_replace("breast cancer", " ", "+"))
ctgov_url <- paste0("https://clinicaltrials.gov/ct2/results/details?term=", search_term_processed)
ct_page <- xml2::read_html(ctgov_url)

# extract related terms
ct_page %>%
  # find elements that match a css selector
  rvest::html_elements(".w3-padding-8:nth-child(1)") %>%
  # retrieve text from element (html_text() is much faster than html_text2())
  rvest::html_text()

Does anyone know the CSS selector to extract out the terms in the first column and first row ("chunk") only?

Howard Baek
  • 186
  • 10
  • 1
    I think it would help to show some code, the output from it, and the desired output. When I use the selector from your question and pass it to `html_text()` I get one result, "Breast Neoplasms", which does not sound like what you describe. – neilfws Jul 26 '22 at 01:36

3 Answers3

0

The td cells of class w3-padding-8 include the synonyms listed in the column you want and the (unwanted) number of studies for the search and the data base.

Because there are two cells containing study numbers following each synonym entry, the following strategy may help to isolate just the synonym column.

First make an html collection of all the td elements of class w3-padding-8:

const cells = document.querySelectorAll('td.w3-padding-8');

then, log the innerText of the first, third, sixth and so on cells (so skipping those containing study numbers):

for (let i=0; i<cells.length; i+=3) {
  console.log(cells[i].innerText);
}

Note the use of i+=3 for the loop incrementor - allowing just cells 0,3,6... etc., containing the synonyms to be listed.

I ran this on the browser console with the link you provided loaded and it returned the list of synonyms. The only part you may have to tinker with is that the loaded table contained three sections: 'breast cancer', 'cancer' and 'breast', and the list contained the synonyms for all three sections. You should be able to isolate the 'breast cancer' block and apply the same idea to retrieve its synonym column.

The key appears to be skipping two cells after each synonym using i+=3.

Dave Pritlove
  • 2,601
  • 3
  • 15
  • 14
  • Hmm I'm not familiar with the code you wrote. Was looking for a solution in R. Thanks though. – Howard Baek Jul 26 '22 at 13:31
  • 1
    Apologies. As far a css selectors go, you could start with the following (it isolates the synonymns but the list will be 'contaminated' with the word 'Synonym' at the start and with irrelevant stuff after the list you want). `tr:not(:first-child) > td:nth-child(1) `. You may be able to refine it to remove the extra stuff or just remove it manually in one go. – Dave Pritlove Jul 26 '22 at 23:07
0

Does it solve your problem to get the table instead?

library(tidyverse)
library(rvest)

"https://clinicaltrials.gov/ct2/results/details?cond=breast+cancer" %>% 
  read_html() %>% 
  html_table() %>% 
  .[[1]] 

# A tibble: 30 × 3
   Terms                   `Search Results*` `Entire Database**`
   <chr>                   <chr>             <chr>              
 1 Synonyms                Synonyms          Synonyms           
 2 breast cancer           12,002 studies    12,002 studies     
 3 Breast Neoplasms        9,539 studies     9,539 studies      
 4 breast carcinomas       917 studies       917 studies        
 5 Breast tumor            159 studies       159 studies        
 6 cancer of the breast    66 studies        66 studies         
 7 Neoplasm of breast      61 studies        61 studies         
 8 cancer of breast        40 studies        40 studies         
 9 Carcinoma of the Breast 33 studies        33 studies         
10 CARCINOMA OF BREAST     32 studies        32 studies         
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows
Chamkrai
  • 5,912
  • 1
  • 4
  • 14
  • Not really. It's nice to have the table, but I just want the first "chunk" from this table, so everything from row 2 to row 19. I'd like a CSS selector to do this automatically. – Howard Baek Jul 26 '22 at 13:28
0

Problem:

I don't believe this is possible with rvest currently as it relies on selectors level 3 under the hood, which only permits a simple selector with negation i.e. inside of :not(). The rows are all at the same DOM level and what you want is to be able to filter out later rows from the first "batch".

What would work, with selectors level 4 which permits selector lists inside of :not(), is:

tr[style]:nth-child(1) ~ tr:not(tr[style]:nth-child(n+2) ~ tr):not([style]) td:first-child

or

tr[style]:nth-child(1) ~ tr:not([style], tr[style]:nth-child(n+2) ~ tr) td:first-child

The above are just working examples for selectors level 4. There are other variants. In the above, a complex selector list is passed to :not() to filter out the next result and any subsequent sibling rows.

See selectors by level here.

I think this limitation is a deliberate consequence of the implementation detail of selectr under the hood. Within the source code, if :not is encountered, the next delimiter is expected to be ")". Rejected are elements that would comprise a selector list such as operators e.g. ~, +. In the current source code you can view this detail on lines 510-528.

Compare the following:

enter image description here

parse_simple_selector, which is a selectr function, objects to the presence of a non-simple selector inside the pseudo class negation.

library(rvest)

link <- 'https://clinicaltrials.gov/ct2/results/details?cond=breast+cancer'
page <- read_html(link) 
selector_list <- 'tr[style]:nth-child(1) ~ tr:not(tr[style]:nth-child(n+2) ~ tr):not([style]) td:first-child'

page |> html_elements(selector_list)    # fail
page |> html_element('tr:not(td ~ td)') # fail 
page |> html_element('tr:not(td)')      # pass

Now, look at soupsieve, a python package used by Beautiful Soup 4, which:

provides selectors from the CSS level 1 specifications up through the latest CSS level 4 drafts and beyond (though some are not yet implemented).

as at 28/7/22

The implemented details from selectors level 4 allow for complex selector lists within :not()

enter image description here

import requests
from bs4 import BeautifulSoup as bs

selector = 'tr[style]:nth-child(1) ~ tr:not( tr[style]:nth-child(n+2) ~ tr):not([style]) td:first-child'
soup = bs(requests.get('https://clinicaltrials.gov/ct2/results/details?cond=breast+cancer').text, 'html.parser')
soup.select(selector)

Solution:

Some possible options might include:

  1. Implement your own helper(s) that extend selectr
  2. Utilise a loop that stops when it hits the next tr with a style attribute following finding the first with your target text
  3. Preferred solution, IMO, switch to xpath e.g.

(//tr[count(preceding-sibling::tr[@style])=1 and count(following-sibling::tr[@style])>=2])//td[1]

In code

page |> html_elements(xpath = '(//tr[count(preceding-sibling::tr[@style])=1 and count(following-sibling::tr[@style])>=2])//td[1]') |>  html_text(trim = T)

Now, these are positional matchings so you might decide to improve by using some text based matching involving your search term.


The idea of using count within xpath I got from @Daniel Haley here

QHarr
  • 83,427
  • 12
  • 54
  • 101