0

i need some help to parse a html, extracting everything starting with http://, containing "abc" until first occurance of " or ' or blank space.

i have some regex like this /http:\/\/abc(.*)\"/ but it's not working well :\

are there any ideas? :)

P.S. sorry for bad english, it's not my natural language ;)

guest86
  • 2,894
  • 8
  • 49
  • 72
  • No but seriously, give us some sample data that you're trying to parse. And explain what you mean by "not working well". – Joshua Evensen Dec 22 '10 at 19:05
  • 1
    @Joshua: No but seriously, OP should use a HTML parser. :) – netcoder Dec 22 '10 at 19:14
  • 4
    PLEASE stop posting links to that comment. It is far too clever for its own good, such that the people who get it are the people who already get it, and the people who need to know don't understand it. – Andy Lester Dec 22 '10 at 20:38

3 Answers3

5

StackOverflow tends to prefer an HTML Document Parser over Regular Expressions for parsing HTML.

However, with that said, if you just want URLs from a string that happens to be HTML, I still believe a Regex is fine for the job.

Try preg_match_all:

preg_match_all("/http:\/\/[^\s'\"]*abc[^\s'\"]*/", $string, $matches);
Jason McCreary
  • 71,546
  • 23
  • 135
  • 174
  • 3
    It's not Stack Overflow that prefers parsing HTML with DOM, it's HTML itself that prefers it over regular expression. ;) – netcoder Dec 22 '10 at 19:16
  • 3
    @netcoder, Fair, but this community typically screams HTML for these types of questions. And while I respect your viewpoint, something such as parsing out URLs is perfectly valid to do with a regex. – Jason McCreary Dec 23 '10 at 01:38
1

Use a parser instead of a regex.

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
Mark Baijens
  • 13,028
  • 11
  • 47
  • 73
0

If all you want to do is extract URLs, regexen are a good choice. You don't need to get into the parser world.

If you have unix-like command tools you could approximate it very simply (assuming one url per line) with two passes:

grep http myfile.html | grep abc

You can use preg_grep() similarly.

preg_match_all ('/http:[^"\' ]+/', $html, $urls);
# $urls contains all the urls from your document
$abc_urls = preg_grep( '/abc/', $urls );
Nathan
  • 3,842
  • 1
  • 26
  • 31
  • Oh dear. This URL has a query string. Therefore it includes `&`. Use a real parser. – Quentin Dec 22 '10 at 19:20
  • That regex would be fine with `&`, just no spaces or quotes. My point with the grep example is that there are practical alternatives to a real parser, depending on what you're trying to do. – Nathan Dec 22 '10 at 19:27
  • It wouldn't pull the URL out though, it would pull out an HTML encoded URL. Parsers have solutions for edge cases built in. – Quentin Dec 23 '10 at 11:58