0

I'm working on a corpus of email messages, and trying to replace all html tags in the corpus with the string ''. How can I replace all html tag using the fact that they begin with >< and end with > ?

Example:

<html>
  <body>
  This is some random text.
 <p>This is some text in a paragraph.</p>
</body>
</html>

Should be translated to:

<html>
  <html>
  This is some random text.
    <html>This is some text in a paragraph.<html>
  <html>
<html>

Thanks

Yoav
  • 999
  • 3
  • 11
  • 30
  • 2
    If these emails are html then you are much better off using the `XML` package. What would you want to do with, e.g., `Link`? Return just `Link`? – jlhoward Jun 06 '14 at 19:50
  • I actually want to identify that there were html tag in the document, I just want to generalize them to one single tag. in your example I would like it to be Link – Yoav Jun 06 '14 at 19:53
  • 1
    Please provide an example of (at least) one of these emails. – jlhoward Jun 06 '14 at 19:55
  • 1
    @Yoav when adding new information, please edit your original question rather than responding in comments. That way the information can be properly formatted and you can make it clear what you want. – MrFlick Jun 06 '14 at 20:04
  • @Yoav What would be the result of precessing this email? – jlhoward Jun 06 '14 at 20:06
  • @jlhoward Please refer to the original question, I just edited it and gave an example of the output. Thanks! – Yoav Jun 06 '14 at 20:18
  • Sure hope none of your emails have any loveicons, `<3` , or even worse, winking eyes, ` >.<` – Carl Witthoft Jun 06 '14 at 21:42

2 Answers2

2

You should use the power of the regex with gsub. If you simply want to replace any <markup_name> by <hml> then gsub("<[^>]+>", "<html>", email_text) will do it.

The trick is [^>]\+ which extends (+) the regex until the first > ([^>] matches any character that is not >).

Math
  • 2,399
  • 2
  • 20
  • 22
  • 1
    When I try this I get an error that `\+` is an unrecognized escape character. If I remove the back-slash it works, e.g. `gsub("<[^>]+>", "", email_text)` – jlhoward Jun 06 '14 at 20:31
  • Sorry, I mixed with `\\w` and sed that needs \. Thanks for the correction, I edited the post. – Math Jun 06 '14 at 20:33
1

Here's another method offered only for completeness since it is less general than @Math's solution that I consider superior. Thinking that one might also use the range-quantifier pattern operators {n,m}. It probably has many deficiencies. It also raises the memory of a famous SO answer: RegEx match open tags except XHTML self-contained tags

 dat <- "<html>
   <body>
   This is some random text.
  <p>This is some text in a paragraph.</p>
 </body>
 </html>"

 gsub("<.{1,5}>", "<html>", dat)
#[1] "<html>\n  <html>\n  This is some random text.\n <html>This is some text in a paragraph.<html>\n<html>\n<html>"

> cat( gsub("<.{1,5}>", "<html>", dat) )
<html>
  <html>
  This is some random text.
 <html>This is some text in a paragraph.<html>
<html>
<html>
Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487