0

I have a portion of HTML code in R like the one below:

"</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA"

I want to use gsub to remove the unwanted HTML code so that the output will be:

XXXX YYYY ZZZZ AAAA

I tried <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> as shown here but fail, why?

How can I do it in R? Thanks.

lokheart
  • 23,743
  • 39
  • 98
  • 169
  • 4
    it might be cleaner to extract names from html code using the `XML` library and `xPath` queries. if you post a link to the webpage containing the html code, there are many on SO who would be able to provide you with pointers on how to extract the desired information. – Ramnath Aug 14 '11 at 14:33
  • 1
    Be careful... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Iterator Aug 14 '11 at 16:55
  • Should this question and the other be merged? http://stackoverflow.com/questions/7057374/remove-anything-within-a-pair-of-parenthesis-using-gsub-in-r – Iterator Aug 14 '11 at 19:13
  • Possible duplicate of [Removing html tags from a string in R](https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r) – divibisan Apr 08 '19 at 19:48

1 Answers1

1

I suggest you heed the warnings of @Ramnath and @Iterator and use a parser instead, but here is the best I can do with your string and regex:

(First add a missing to the end of your input string)

x <- "</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA</a>"

The code:

x1 <- gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", x)
x1
[1] "</a>  XXXX</a>  YYYY</a>  ZZZ</a> AAAA</a>"

gsub("</a>", "", x1)
[1] "  XXXX  YYYY  ZZZ AAAA"
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 1
    No `perl = TRUE`? I always feel I'm living dangerously if I don't use that in my R regex functions. – Iterator Aug 14 '11 at 19:12
  • Sadly I'm not of the perl generation, so I always use `perl=FALSE`. Personal preference, I imagine... – Andrie Aug 14 '11 at 20:18