1

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.

If I have a string of html such as this

some_html = '<div><b>This is some BOLD text</b></div>'

I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this

html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Spencer Cooley
  • 8,471
  • 16
  • 48
  • 63
  • Is it always tag tag text tag tag? – m0skit0 Oct 25 '11 at 08:06
  • no. I just used a simple example. The html is being stored in a db for a blog post. The blog post is being made with a rich text editor, so the html is just being generated depending on what the user inputs. I need to process the stored html so I can append it back into my rich text editor (contentEditable div) when the user wants to edit the post. – Spencer Cooley Oct 25 '11 at 08:13

3 Answers3

5

Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:

html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")

You can then call html_doc.children to get a nodeset and work your way from there

html_doc.children  # returns a nodeset
Dty
  • 12,253
  • 6
  • 43
  • 61
  • Cool. I will give it a try. Can you feed any html string in the Nokogiri::HTML method? Or does it have to be a whole html document? – Spencer Cooley Oct 25 '11 at 08:16
  • Ok thanks a lot. I am going to dig through the docs for a little bit and try this out. – Spencer Cooley Oct 25 '11 at 08:19
  • This doesn't actually answer the question. – pguardiario Oct 25 '11 at 08:56
  • @pguardiario you're right that I don't answer the question in the *title*. But the author's question states he's unsure of how to solve this problem and he's looking for help to point him in the *right* direction. That's what I tried to answer. – Dty Oct 25 '11 at 15:44
  • Hmm, nokogiri is the right tool for parsing html, but it won't help him split it. – pguardiario Oct 26 '11 at 23:49
4

Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.

It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.

Community
  • 1
  • 1
Xavi López
  • 27,550
  • 11
  • 97
  • 161
0
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}
pguardiario
  • 53,827
  • 19
  • 119
  • 159