php regex for parsing html

Question

i need some help to parse a html, extracting everything starting with http://, containing "abc" until first occurance of " or ' or blank space.

i have some regex like this /http:\/\/abc(.*)\"/ but it's not working well :\

are there any ideas? :)

P.S. sorry for bad english, it's not my natural language ;)

No but seriously, give us some sample data that you're trying to parse. And explain what you mean by "not working well". — Joshua Evensen, Dec 22 '10 at 19:05
PLEASE stop posting links to that comment. It is far too clever for its own good, such that the people who get it are the people who already get it, and the people who need to know don't understand it. — Andy Lester, Dec 22 '10 at 20:38

score 5 · Accepted Answer · answered Dec 22 '10 at 19:07

5

StackOverflow tends to prefer an HTML Document Parser over Regular Expressions for parsing HTML.

However, with that said, if you just want URLs from a string that happens to be HTML, I still believe a Regex is fine for the job.

preg_match_all("/http:\/\/[^\s'\"]*abc[^\s'\"]*/", $string, $matches);

answered Dec 22 '10 at 19:07

Jason McCreary

3

It's not Stack Overflow that prefers parsing HTML with DOM, it's HTML itself that prefers it over regular expression. ;) – netcoder Dec 22 '10 at 19:16
3

@netcoder, Fair, but this community typically screams HTML for these types of questions. And while I respect your viewpoint, something such as parsing out URLs is perfectly valid to do with a regex. – Jason McCreary Dec 23 '10 at 01:38

score 1 · Answer 2 · edited May 23 '17 at 12:14

1

Use a parser instead of a regex.

edited May 23 '17 at 12:14

Community

answered Dec 22 '10 at 19:03

Mark Baijens

Nathan · Answer 3 · 2010-12-22T19:22:10.113

0

If all you want to do is extract URLs, regexen are a good choice. You don't need to get into the parser world.

If you have unix-like command tools you could approximate it very simply (assuming one url per line) with two passes:

grep http myfile.html | grep abc

You can use preg_grep() similarly.

preg_match_all ('/http:[^"\' ]+/', $html, $urls);
# $urls contains all the urls from your document
$abc_urls = preg_grep( '/abc/', $urls );

edited Dec 22 '10 at 19:22

answered Dec 22 '10 at 19:15

Nathan

Oh dear. This URL has a query string. Therefore it includes `&`. Use a real parser. – Quentin Dec 22 '10 at 19:20
That regex would be fine with `&`, just no spaces or quotes. My point with the grep example is that there are practical alternatives to a real parser, depending on what you're trying to do. – Nathan Dec 22 '10 at 19:27
It wouldn't pull the URL out though, it would pull out an HTML encoded URL. Parsers have solutions for edge cases built in. – Quentin Dec 23 '10 at 11:58

3 Answers3