0

I want to remove all subscripts from a piece of html code, except the subscript “rep”.

For instance, the string "t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2" should become: "t(10) = 23, p<sub>rep</sub>=.2"

I was trying things like:

txt <- "t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2"
gsub(pattern="<sub>(?!rep).*</sub>",replacement="",txt,perl=TRUE)

But the problem is that this line of code deletes everything between the first <sub> and the last </sub> in the html file...

Michele
  • 33
  • 5

2 Answers2

1

Use the XML library to parse the html. You can select the nodes you want to remove and use removeNodes:

library(XML)
xData <- htmlParse("t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2")
remNodes <- xData['//sub[not(contains(., "rep"))]']
removeNodes(remNodes)
> xData
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
  <html><body>t(10) = 23, p<sub>rep</sub>=.2</body></html>
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • thanks! but why is using a parser better than just the non greedy regex? – Michele Jul 04 '14 at 14:16
  • At this point someone will normally point you to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – jdharrison Jul 04 '14 at 14:18
  • oh dear, i have " given in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane". for now, the regex seems to work fine, but i'll see if i can change my regex into a parser. thanks! – Michele Jul 04 '14 at 14:27
1

It is recommended to use a Parser when dealing with HTML, but to explain your problem...

The issue is that .* will go all the way down the string then eventually backtrack to allow the closing tag to match. As soon as it backtracks to the second closing tag the regular expression will match.

The simple fix is to follow .* with ? to prevent greediness. What this means is look for any character (except newline) and find (0 or more) until you get to a closing tag. Once you specify the question mark ?, you're telling the regex engine (do not be greedy.. as soon as you find a closing tag... stop...)

txt <- 't<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2'
gsub('<sub>(?!rep).*?</sub>', '', txt, perl=T)
# [1] "t(10) = 23, p<sub>rep</sub>=.2"
hwnd
  • 69,796
  • 4
  • 95
  • 132