2

This is how far I got. This is working:

$urls = $this->match_all('/<a href="(http:\/\/www.imdb.de\/title\/tt.*?)".*?>.*?<\/a>/ms',
            $content, 1);

Now I wan't to do the same with a different site. But the link of the site has different structure: http://www.example.org/ANYTHING

I don't know what I am doing wrong but with this other site (example.org) it is not working.

Here is what I have tried

$urls = $this->match_all('/<a href="(http:\/\/www.example.org\/.*?)".*?>.*?<\/a>/ms',
    $content, 1);

Thank you for your help. Stackoverflow is so awesome!

Helena
  • 115
  • 1
  • 3
  • 12
  • It should match anything after this .org/ . As I said http://www.example.org/ANYTHING – Helena Jan 06 '12 at 00:17
  • The delimiter character should be a character that does not occur, or rarely occurs within the regex. So using `/` as a delimiter for URL-related regexes is a bad choice - use `#` or something, it will make your regex a lot more readable and debugable. – DaveRandom Jan 06 '12 at 00:22
  • Since you are asking regex question after regex question, I think it might be time to enlighten you on [some tools that aid in constructing them](http://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world) -or- [online tools](http://stackoverflow.com/questions/2491930/is-there-an-online-regexbuddy-like-regular-expression-analyzer) and http://regular-expressions.info/ for an introduction. Be sure to read up on when to use [DOM vs. regex](http://stackoverflow.com/a/3650431/345031) (depends on proficiency, and if the html is always normalized+well-known). – mario Jan 06 '12 at 00:26
  • you may want to parse the document using DOMDocument first to get the anchor tags. your above regex fails to find many possible anchor tags such as ``. it assumes the tag has no attributes before the href attribute. – dqhendricks Jan 06 '12 at 00:56

3 Answers3

3

ANYTHING is usually represented by .*? (which you already use in your original regex). You could also use [^"]+ as placeholder in your case.

mario
  • 144,265
  • 20
  • 237
  • 291
0

It sounds like you want the following regular expression:

'/<a href="(http:\/\/example\.org\/.*?)".*?>.*?<\/a>/ms'

You can also use a different delimiter to avoid escaping the backslashes:

'#<a href="(http://example\.org/.*?)".*?>.*?</a>#ms'

Note the escaping of the . in the domain name, as you intend to match a literal ., not any character.

cmbuckley
  • 40,217
  • 9
  • 77
  • 91
0

I think this should help

/<a href="(http:\/\/www.example.org\/.*?)".*?>.*?<\/a>/ms
<a href="http://www.example.org/ANYTHING">text</a>

Result:

Array
(
    [0] => <a href="http://www.example.org/ANYTHING">text</a>
    [1] => http://www.example.org/ANYTHING
)

EDIT: I always find this site very useful for when i want to try out preg_match - http://www.solmetra.com/scripts/regex/index.php

rroche
  • 1,262
  • 1
  • 13
  • 29