Regex html help needed

Question

I have an HTML response body/string. Part of that html content are these strings -

<h2><a href="javascript:;" class="user-name-class">MY_USER_NAME<b></b></a></h2>

["media_detail","init",[false,"",null,true,1,4,"99999_XXXXX_99999",11836530,"00076f7474727febc37a8825d373a5be","\/p\/LdvJWSF-6b\/","\/accounts\/login\/"]],

From these I need to extract MY_USER_NAME and 99999_XXXXX_99999

I would appreciate help from regex rockstars. This is in ruby 1.9.3. Thanks.

UPDATE: We are using regex because this is not done in realtime, so performance is not a concern.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Gus, Nov 02 '12 at 20:19
This is not a question of performance. Regular expressions are simply **unable to** parse HTML correctly. Not even speaking of invalid HTML that could be taken care of by a DOM parser. — Martin Ender, Nov 02 '12 at 20:54

score 3 · Answer 1 · edited May 23 '17 at 12:12

3

The first one is HTML so you should parse it with HTML and another is JSON, so you could use some JSON library. Don't use regex. It's evil.

edited May 23 '17 at 12:12

Community

1
1

answered Nov 02 '12 at 20:20

d33tah

10,999
13
68
158

score 0 · Answer 2 · answered Nov 02 '12 at 20:33

0

If you don't want to use HTML/JSON libraries, you can get the first one with:

str.gsub!(/<.*?>/, '')

To regex the second one you're going to have to tell us more about the format of the string... what's consistent, what isn't, etc.

answered Nov 02 '12 at 20:33

Philip Hallstrom

19,673
2
42
46

Ωmega · Answer 3 · 2012-11-02T21:29:03.053

0

You can use

s.split(/"user-name-class">/)[1].split(/</)[0]

(see this demo)

and

s.split(/\["media_detail"/)[1].split(/\[/)[1].split(/"?,"?/)[6]

(see this demo)

edited Nov 02 '12 at 21:29

answered Nov 02 '12 at 20:42

Ωmega

42,614
34
134
203

So I have this big HTML body/string, how do I apply this regex to the whole html content? – kapso Nov 02 '12 at 21:10
@user310525 - Just put the entire html content to string `s`. Did you check demo links? – Ωmega Nov 02 '12 at 21:20
I get this error -undefined method `split' for nil:NilClass. This is what I tried - html_body.split(/"user-name-class">/)[1].split(/)[0] – kapso Nov 02 '12 at 21:50
@user310525 - You got that error, because first `split(/"user-name-class">/)` failed, so there is no `[1]` element, which means your string `html_body` does not contain **"user-name-class">** inside of it. Print your `html_body` and you will see... – Ωmega Nov 02 '12 at 21:53
Sorry false alarm, yea that works. Also is there a way to avoid using methods split & []. I was hoping to avoid nil errors, in case the pattern changes. – kapso Nov 02 '12 at 22:44

Regex html help needed

3 Answers3