Extracting data from attribute using regex

Question

I have the following pattern:

<tbody>
 <div id="aaa">Music</div>
 Ggfdlkjgfds f$5 j3k 
 <div title="Song title #1"></div>
 Fdjflkdsjfds
 <div title="Song title #2"></div>
</tbody>

And I have to extract "Song title #1" and "Song title #2" from this string.

By far I wrote something like this:

(Music)(.*?)(title=\")(.*?)(\")(<\/tbody>)

But it doesn't work. How can I do that?

Thanks!

EDIT. This is not HTML, but the part of the source code, loaded from facebook user's page. There can be basically anything between those lines, so I'm looking only for three keywords:

Music
title="
</tbody>

And wanna find all matches after the middle one.

I don't think you need to escape the forward slash. Also you wrote tbody instead of div. — Benjy Kessler, May 07 '15 at 22:26
I have to do this with many thousand of requests, it would be too slow. It's from facebook. — khernik, May 07 '15 at 22:27

mhall · Accepted Answer · 2015-05-07T23:02:59.790

Yet another answer :-P

Edit: Updated due to new info in question.

$str = <<<EOS
<tbody>
 <div id="aaa">Music</div>
 Ggfdlkjgfds f$5 j3k
 <div title="Song title #1"></div>
 Fdjflkdsjfds
 <div title="Song title #2"></div>
 Foobarbaz
 <div title="Song title #3"></div>
</tbody>
EOS;

// First find string between "Music" and "</tbody>"
if (preg_match('#\bMusic\b(.*?)</tbody>#s', $str, $r)) {
    // Then get all song titles
    preg_match_all('#.*?(?:title="(.*?)")#s', $r[1], $r);
    print_r($r[1]);
}

Output:

Array
(
    [0] => Song title #1
    [1] => Song title #2
    [2] => Song title #3
)

score 0 · Answer 2 · edited May 23 '17 at 11:43

0

Don't use regular expressions to parse HTML, HTML is not a regular language. Use other tools like http://simplehtmldom.sourceforge.net/.

Useful post here on SO:

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

edited May 23 '17 at 11:43

Community

1
1

answered May 07 '15 at 22:34

Jorick Spitzen

1,559
1
13
25

1

good thing he's not parsing html then, he just wants to rip a value out of a chunk of text. – castis May 07 '15 at 22:37
As an aside, using a regex is probably not good way here, but not for theorical reasons (read carefully the comments under the question you linked). The fact that HTML is not a regular language is a false argument. The main problem is that there is no real reasons to use a text approach when you have a structured language under the eyes and when the language used (php) has build-in implementations of libxml. About simplehtmldom, I think that this lib is useless, slow and not so simple (I suggest you to take a look in the code). – Casimir et Hippolyte May 08 '15 at 00:25

Extracting data from attribute using regex

2 Answers2