regex: match all subscripts in an html file except a specific one

Question

I want to remove all subscripts from a piece of html code, except the subscript “rep”.

For instance, the string "ti(10) = 23, prep=.2" should become: "t(10) = 23, prep=.2"

I was trying things like:

txt <- "t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2"
gsub(pattern="<sub>(?!rep).*</sub>",replacement="",txt,perl=TRUE)

But the problem is that this line of code deletes everything between the first  and the last  in the html file...

Your pattern is greedy. Make it ungreedy by adding a `?` at the end. Like so: `"_(?!rep).*?`. Better yet, use a real HTML parser to achieve this. — Amal Murali, Jul 03 '14 at 13:32

score 1 · Answer 1 · answered Jul 03 '14 at 13:42

1

Use the XML library to parse the html. You can select the nodes you want to remove and use removeNodes:

library(XML)
xData <- htmlParse("t<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2")
remNodes <- xData['//sub[not(contains(., "rep"))]']
removeNodes(remNodes)
> xData
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
  <html><body>t(10) = 23, p<sub>rep</sub>=.2</body></html>

answered Jul 03 '14 at 13:42

jdharrison

30,085
4
77
89

thanks! but why is using a parser better than just the non greedy regex? – Michele Jul 04 '14 at 14:16
At this point someone will normally point you to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – jdharrison Jul 04 '14 at 14:18
oh dear, i have " given in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane". for now, the regex seems to work fine, but i'll see if i can change my regex into a parser. thanks! – Michele Jul 04 '14 at 14:27

hwnd · Accepted Answer · 2014-07-03T14:16:17.813

It is recommended to use a Parser when dealing with HTML, but to explain your problem...

The issue is that .* will go all the way down the string then eventually backtrack to allow the closing tag to match. As soon as it backtracks to the second closing tag the regular expression will match.

The simple fix is to follow .* with ? to prevent greediness. What this means is look for any character (except newline) and find (0 or more) until you get to a closing tag. Once you specify the question mark ?, you're telling the regex engine (do not be greedy.. as soon as you find a closing tag... stop...)

txt <- 't<sub>i</sub>(10) = 23, p<sub>rep</sub>=.2'
gsub('<sub>(?!rep).*?</sub>', '', txt, perl=T)
# [1] "t(10) = 23, p<sub>rep</sub>=.2"

regex: match all subscripts in an html file except a specific one

2 Answers2