0

I'm using the R programming language. I'm hoping to find and make bold a series of four letters (amino acids, if you're curious) in a large html table of letters. I want to do this through html table navigation. If I were using regex on a normal string of letters, it would be "([KR].[ST][ILV])". This would find the letters RSSI or KATV, for instance. Unfortunately, the actual string I'm looking for would look something like this:

<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>

The end result I want is this:

<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt><b>R</b></tt></td>
<td bgcolor=""><tt><b>S</b></tt></td>
<td bgcolor="pink"><tt><b>S</b></tt></td>
<td bgcolor=""><tt><b>I</b></tt></td>

I've written a monster-sized regex to find this sequence (attached below), but it doesn't seem to work. And I realize now that I should be using html commands, but I'm having a good deal of trouble finding websites that tell me how to search-and-replace. What should I be searching for? And/or how would I accomplish what I've described above?

This is my monster-sized regex to find the sequence I want, but it doesn't seem to work. I now realize, of course, that I was going at it from the wrong direction.

`regexp <- '(
[\\<<td bgcolor=""><tt>K</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>K</tt></td>\\>
\\<<td bgcolor=""><tt>R</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>R</tt></td>\\>]
[\\<<td bgcolor=""><tt>.</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>.</tt></td>\\>]
[\\<<td bgcolor=""><tt>S</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>S</tt></td>\\>
\\<<td bgcolor=""><tt>T</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>T</tt></td>\\>]
[\\<<td bgcolor=""><tt>I</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>I</tt></td>\\>
\\<<td bgcolor=""><tt>L</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>L</tt></td>\\>
\\<<td bgcolor=""><tt>V</tt></td>\\>
\\<<td bgcolor="\\w+"><tt>V</tt></td>\\>])'
`
  • Didn't you ask the [same question](https://stackoverflow.com/questions/45575344/how-to-find-and-bold-a-series-of-four-letters-in-an-html-table-this-post-has-be) 2 hours ago ? – Steven Beaupré Aug 08 '17 at 20:37
  • I edited it to change the focus of the question. I wasn't sure what to do about the label marking it as a duplicate, since it wasn't referring to the question I have now asked. I'd be interested in hearing whatever I'm supposed to do, though! – Robin Rounthwaite Aug 08 '17 at 20:46

1 Answers1

1

Maybe try this approach instead of regular expressions:

library(xml2)
library(tidyverse)
txt <- '<center><table class="sequence-table"><tr><th align="left">
<tr>
<td bgcolor="lightgreen"><tt>R</tt></td>
<td bgcolor=""><tt>S</tt></td>
<td bgcolor="pink"><tt>S</tt></td>
<td bgcolor=""><tt>I</tt></td>' 
needles <- c("RSSI", "KMSV")
doc <-  read_html(txt)
doc %>% 
  xml_find_all("//tr") %>% 
  keep(xml_text(.) %in% gsub("(.)", "\\1\n", needles)) %>% 
  xml_find_all("td/tt/text()") %>% 
  xml_add_parent("b") 
write_html(doc, tf <- tempfile(fileext = ".html"))
shell.exec(tf) # open temp file on windows

This wraps each column text into <b>...</b> (and saves the result to a temporary file).

cat(as.character(doc))
# ...
# <center><table class="sequence-table">
# <tr><th align="left">
# </th></tr>
# <tr>
# <td bgcolor="lightgreen"><tt><b>R</b></tt></td>
# <td bgcolor=""><tt><b>S</b></tt></td>
# <td bgcolor="pink"><tt><b>S</b></tt></td>
# <td bgcolor=""><tt><b>I</b></tt></td>
# ...
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • This is great, thanks! I'm still a little stuck, though, on how I could make it more generalized. For instance, I want my program to find code that also finds K M S V. (Note the change of both bgcolor and the letter string) – Robin Rounthwaite Aug 08 '17 at 21:08
  • 1
    But why wouldn't you want to [use regex to parse HTML??](https://stackoverflow.com/a/1732454/903061) (+1, good answer) – Gregor Thomas Aug 08 '17 at 21:14
  • @RobinRounthwaite Hm so you want only to fatten specific x-letter-rows? See my edit and the new 'needles' variable. Maybe there are better/more efficient ways to do it. Thx Gregor btw. :) – lukeA Aug 08 '17 at 21:59
  • What package is your keep from? – Robin Rounthwaite Aug 09 '17 at 03:22
  • @RobinRounthwaite It's from `purrr` or rather the `tidyverse` - I replaced `magrittr` in my post. – lukeA Aug 09 '17 at 08:14