7

I have the following string:

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

I would like to extract the string between the two <body> tags. The result I am looking for is:

substring = "<body>Iwant\to+extr@ctth!sstr|ng<body>"

Note that the substring between the two <body> tags can contain letters, numbers, punctuation and special characters.

Is there an easy way of doing this?

halfer
  • 19,824
  • 17
  • 99
  • 186
Mayou
  • 8,498
  • 16
  • 59
  • 98

4 Answers4

7

Here is the regular expression way:

regmatches(string, regexpr('<body>.+<body>', string))
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
6
regex = '<body>.+?<body>'

You want the non-greedy (.+?), so that it doesn't group as many <body> tags as possible.

If you're solely using a regex with no auxiliary functions, you're going to need a capturing group to extract what is required, ie:

regex = '(<body>.+?<body>)'
Steve P.
  • 14,489
  • 8
  • 42
  • 72
2

strsplit() should help you:

>string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
>x = strsplit(string, '<body>', fixed = FALSE, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "asflkjsdhlkjsdhglk"         "Iwant\to+extr@ctth!sstr|ng" "sdgdfsghsghsgh"  
> x[[1]][2]
[1] "Iwant\to+extr@ctth!sstr|ng"

Of course, this gives you all three parts of the string and does not include the tag.

Stu
  • 1,543
  • 3
  • 17
  • 31
  • Thank you very much. But the body tags in your solution gets excluded. I want them to be returned as well. – Mayou Nov 26 '13 at 18:13
0

I believe that Matthew's and Steve's answers are both acceptable. Here is another solution:

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

regmatches(string, regexpr('<body>.+<body>', string))

output = sub(".*(<body>.+<body>).*", "\\1", string)

print (output)
Gardener
  • 2,591
  • 1
  • 13
  • 22