0

Let's say I wanted to extract a string found between two defined strings. For example, the function,we'll call it parse_between() would work as follows in R:

>main_string<-"the quick brown fox>$ jumps over the lazy </ dog"
>substring<-parse_between(main_string, begin=">$", end="</")
>substring
[1] " jumps over the lazy "

Even better if it could produce a vector with elements corresponding to each instance. I've searched some of the packages available for string manipulation like "stringr" but have not found a function to do this as easily as the example shows. My motivation is to parse html files unfortunately despite searching I haven't found an html parser for R.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
iantist
  • 833
  • 2
  • 12
  • 27

1 Answers1

2

First off, read this question & answer very carefully: RegEx match open tags except XHTML self-contained tags

Then, if still undeterred, use regex or gsub , both of which have metacharacters specifying the beginning or end of a line. What you could do then, is replace

{start_of_line through to ">$"} 

with nothing, then replace

{"</" through to end_of_line}

with nothing.

Community
  • 1
  • 1
Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • Thought I could avoid actually learning regex but doesn't look like it. I guess it's good for me anyhow. Thanks for the laugh. – iantist Feb 12 '13 at 16:33
  • Ok so to delete the first part one would use `uptostring<-sub(x=main_string,pattern=".x",replace="")` and then the second would be somthing similar. this video [link](http://www.youtube.com/watch?v=NvHjYOilOf8) was very helpful in understanding regular expressions. Is there a way to say up to but not including a character or starting from but not including a character? – iantist Feb 12 '13 at 18:54
  • yes: basically you tell regex to start at beginning of the line ("$"), accept any characters (using "." ) so long as they are not (for this example, want not to include a "K" `$.[^K]` If that doesn't work, blame me :-) and dig up a regex cheat sheet. – Carl Witthoft Feb 12 '13 at 19:35