Extract a substring between two words from a string

Question

I have the following string:

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

I would like to extract the string between the two <body> tags. The result I am looking for is:

substring = "<body>Iwant\to+extr@ctth!sstr|ng<body>"

Note that the substring between the two <body> tags can contain letters, numbers, punctuation and special characters.

Is there an easy way of doing this?

Matthew Plourde · Answer 1 · 2013-11-26T18:19:37.497

7

Here is the regular expression way:

regmatches(string, regexpr('<body>.+<body>', string))

edited Nov 26 '13 at 18:19

answered Nov 26 '13 at 18:08

Matthew Plourde

43,932
7
96
113

Why do you need the perl = TRUE in this? – TheComeOnMan Nov 26 '13 at 18:12
@Codoremifa you don't, thanks. Originally, I thought OP wanted to exclude the tags, to which I suggested using lookahead assertions, requiring the `perl=TRUE` flag. – Matthew Plourde Nov 26 '13 at 18:19
1

One advantage of `perl=TRUE` is that [it's faster](http://stackoverflow.com/questions/17757534/when-does-setting-perl-true-in-strsplit-does-not-work-as-intended-or-at-all). – Arun Nov 27 '13 at 23:26
@Arun no kidding. Thanks, I wasn't aware of this. – Matthew Plourde Nov 27 '13 at 23:42

Steve P. · Answer 2 · 2013-11-28T00:37:20.613

6

regex = '<body>.+?<body>'

You want the non-greedy (.+?), so that it doesn't group as many <body> tags as possible.

If you're solely using a regex with no auxiliary functions, you're going to need a capturing group to extract what is required, ie:

regex = '(<body>.+?<body>)'

edited Nov 28 '13 at 00:37

answered Nov 26 '13 at 18:16

Steve P.

14,489
8
42
72

score 2 · Answer 3 · answered Nov 26 '13 at 18:05

2

strsplit() should help you:

>string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
>x = strsplit(string, '<body>', fixed = FALSE, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "asflkjsdhlkjsdhglk"         "Iwant\to+extr@ctth!sstr|ng" "sdgdfsghsghsgh"  
> x[[1]][2]
[1] "Iwant\to+extr@ctth!sstr|ng"

Of course, this gives you all three parts of the string and does not include the tag.

answered Nov 26 '13 at 18:05

Stu

1,543
3
17
31

Thank you very much. But the body tags in your solution gets excluded. I want them to be returned as well. – Mayou Nov 26 '13 at 18:13

score 0 · Answer 4 · answered Aug 27 '17 at 07:00

I believe that Matthew's and Steve's answers are both acceptable. Here is another solution:

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

regmatches(string, regexpr('<body>.+<body>', string))

output = sub(".*(<body>.+<body>).*", "\\1", string)

print (output)

Extract a substring between two words from a string

4 Answers4