1

Basically I want to strip the document of words between blockquotes. I'm a regular expression newb and even after using rubular, I'm no closer to the answer.

Any help is appreciated.

sent-hil
  • 18,635
  • 16
  • 56
  • 74

3 Answers3

10

Use an HTML parser and forget regular expressions. Regex is incapable of correctly handling HTML.

doc = Nokogiri::HTML(your_html)
doc.xpath("//blockquote").remove

From: Strip text from HTML document using Ruby

There are more examples of how to use Nokogiri and XPath, if you look around.

Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
0

raw example:

/<blockquote>([^<]*)<\/blockquote>/
Oleg Razgulyaev
  • 5,757
  • 4
  • 28
  • 28
  • 4
    This fails for `
    Some bold text
    `. As I said: Regex is *technically incapable* of correctly handling HTML.
    – Tomalak Apr 19 '10 at 08:02
0

Sample string:

<blockquote>Hello world</blockquote>

type the following regex in rubular <blockquote>(.+?)</blockquote>

or for something more generic:

<.*?>(.+?)</.*?>

hope it helps!

Paul
  • 171
  • 2
  • 10
  • This fails for `
    Some
    quoted text
    within a quote.
    `.
    – Tomalak Apr 19 '10 at 12:16
  • if we are just talking ruby: resultarray = htmlstring.split(/<.*?>/). The split() method will disregard the regex match and the text between the matches is kept. FYI: the scan() method will perform the opposite of this. if you're a newb, i suggest to spend some time learning regexs, it's pretty language agnostic and will serve you well. – Paul Apr 19 '10 at 17:28
  • If this comment was for me: No, I'm not a "newb" as far as regular expressions go. ;) And `htmlstring.split(/<.*?>/)` fails for `Don't do HTML with RegEx`. – Tomalak Apr 19 '10 at 18:56