0

I have a long html string with

Length - 1
Class and Mode - character

......uygdasd class="vip" title="Click this link to access The Big Bang Theory: The Complete Fourth Season (DVD, 2011, 3-Disc Set).....

is it possible to extract a part of that string based on text in it. Subtract everything from class="vip" title="Click this link to access to (DVD, 2011, as a result to get this

The Big Bang Theory: The Complete Fourth Season

Thank for a help.

Dima Sukhorukov
  • 129
  • 4
  • 13
  • I think the questioner has difficulty with English and actually means "extract" (== "retain") rather than "subtract" (=="remove"). – IRTFM Apr 29 '15 at 20:13
  • Is the pattern always "Click to access... things you want... (Extra stuff)" ? – rawr Apr 29 '15 at 20:15
  • @BondedDust i wanted to remove everything before `class="vip" title="Click this link to access` and after `class="vip" title="Click this link to access` and live only `The Big Bang Theory: The Complete Fourth Season` Sorry For My Bad English – Dima Sukhorukov Apr 29 '15 at 20:20
  • @rawr yes "Click to access... things you want... (Extra stuff)" is a pattern – Dima Sukhorukov Apr 29 '15 at 20:22
  • 1
    Don't grep html... use `rvest` to parse it. – cory Apr 29 '15 at 20:26
  • @cory if i use `rvest` has generated me a problem, till now i haven't find a solution http://stackoverflow.com/questions/29929363/r-rvest-for-and-error-server-error-503-service-unavailable – Dima Sukhorukov Apr 29 '15 at 20:29

1 Answers1

2

Use grouping operators (). This throws away anything up to the "link to access " and after the "DVD," and only keeps the match for the second group. The expression .+ means <anything, of any length>. See the ?regex help page for further details about the interpretation of "^" and "$" and the use of \\N in replacements:

 htxt <- 'uygdasd class="vip" title="Click this link to access The Big Bang Theory: The Complete Fourth Season (DVD, 2011, 3-Disc Set).....'

gsub(pattern= "^(.+link to access )(.+)( \\(DVD,.+$)", "\\2", htxt)
[1] "The Big Bang Theory: The Complete Fourth Season"

There is, of course, the famous, highly-voted response to this question: RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487