-5

I have an HTML response body/string. Part of that html content are these strings -

<h2><a href="javascript:;" class="user-name-class">MY_USER_NAME<b></b></a></h2>

["media_detail","init",[false,"",null,true,1,4,"99999_XXXXX_99999",11836530,"00076f7474727febc37a8825d373a5be","\/p\/LdvJWSF-6b\/","\/accounts\/login\/"]],

From these I need to extract MY_USER_NAME and 99999_XXXXX_99999

I would appreciate help from regex rockstars. This is in ruby 1.9.3. Thanks.

UPDATE: We are using regex because this is not done in realtime, so performance is not a concern.

kapso
  • 11,703
  • 16
  • 58
  • 76
  • 4
    I wouldn't use regex for this. Use an HTML/XML parser. – Jordan Kaye Nov 02 '12 at 20:10
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Gus Nov 02 '12 at 20:19
  • 1
    This is not a question of performance. Regular expressions are simply **unable to** parse HTML correctly. Not even speaking of invalid HTML that could be taken care of by a DOM parser. – Martin Ender Nov 02 '12 at 20:54

3 Answers3

3

The first one is HTML so you should parse it with HTML and another is JSON, so you could use some JSON library. Don't use regex. It's evil.

Community
  • 1
  • 1
d33tah
  • 10,999
  • 13
  • 68
  • 158
0

If you don't want to use HTML/JSON libraries, you can get the first one with:

str.gsub!(/<.*?>/, '')

To regex the second one you're going to have to tell us more about the format of the string... what's consistent, what isn't, etc.

Philip Hallstrom
  • 19,673
  • 2
  • 42
  • 46
0

You can use

s.split(/"user-name-class">/)[1].split(/</)[0]

(see this demo)

and

s.split(/\["media_detail"/)[1].split(/\[/)[1].split(/"?,"?/)[6]

(see this demo)

Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • So I have this big HTML body/string, how do I apply this regex to the whole html content? – kapso Nov 02 '12 at 21:10
  • @user310525 - Just put the entire html content to string `s`. Did you check demo links? – Ωmega Nov 02 '12 at 21:20
  • I get this error -undefined method `split' for nil:NilClass. This is what I tried - html_body.split(/"user-name-class">/)[1].split(/)[0] – kapso Nov 02 '12 at 21:50
  • @user310525 - You got that error, because first `split(/"user-name-class">/)` failed, so there is no `[1]` element, which means your string `html_body` does not contain **"user-name-class">** inside of it. Print your `html_body` and you will see... – Ωmega Nov 02 '12 at 21:53
  • Sorry false alarm, yea that works. Also is there a way to avoid using methods split & []. I was hoping to avoid nil errors, in case the pattern changes. – kapso Nov 02 '12 at 22:44