Get the content of the href attribute of an a element

Question

Possible Duplicate:
Grabbing the href attribute of an A element

Hello,

I have the following html I want to parse:

<td align="left" nowrap="nowrap"><a href="XXXXXXX">

I want to save XXXXX on a variable. I know next to nothing of regular expressions. I know how to do it using strpos, substr, etc. But I believe it is slower than using regex.

if (preg_match('!<td align="left" NOWRAP><a href=".\s+/.+">!', $result, $matches))
    echo $matches[1];
else
    echo "error!!!";

I know the previous code is an atrocity to a regex expert. But I really have no idea how to do it. I need some tips, not the full solution.

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — Diarmaid, Apr 29 '11 at 14:42
After testing dom and regex, i've found out that using strpos, substr is actually faster... — Cornwell, Apr 29 '11 at 15:59

score 3 · Answer 1 · edited May 23 '17 at 12:19

3

Here's my (not remotely original) tip: don't use regex to parse HTML. Use an HTML parser.

See How do you parse and process HTML/XML in PHP?.

edited May 23 '17 at 12:19

Community

1
1

answered Apr 29 '11 at 14:40

Matt Ball

354,903
100
647
710

+1 for good advice 2 seconds earlier :D – alex Apr 29 '11 at 14:40

score 2 · Answer 2 · answered Apr 29 '11 at 14:40

2

One thing of knowing regex is to know when not to use them.

Often when you want to parse HTML, 9/10 times, regex is not the right tool.

You can use a DOM parser.

answered Apr 29 '11 at 14:40

alex

479,566
201
878
984

score 1 · Answer 3 · answered Apr 29 '11 at 14:46

1

If your structure is always like the same you posted, you can use this REGEX:

<td\s+align="left"\s+nowrap="nowrap">\s*<a\s+href="(.*?")>

and then take the group #1 that is the string between parenthesis. You have to make a group, a zone between the parenthesis wich contains the data you would get. This link contains useful information about regex and the PHP implementation.

answered Apr 29 '11 at 14:46

Alberto

1,569
1
22
41

Thank you, but I get this error: "Unknown modifier '\' " – Cornwell Apr 29 '11 at 14:51
Try to escape the slash, like '\\' – Alberto Apr 29 '11 at 20:14

Get the content of the href attribute of an a element

3 Answers3