Regular Expression (preg_match) match anything

Question

This is how far I got. This is working:

$urls = $this->match_all('/<a href="(http:\/\/www.imdb.de\/title\/tt.*?)".*?>.*?<\/a>/ms',
            $content, 1);

Now I wan't to do the same with a different site. But the link of the site has different structure: http://www.example.org/ANYTHING

I don't know what I am doing wrong but with this other site (example.org) it is not working.

Here is what I have tried

$urls = $this->match_all('/<a href="(http:\/\/www.example.org\/.*?)".*?>.*?<\/a>/ms',
    $content, 1);

Thank you for your help. Stackoverflow is so awesome!

It should match anything after this .org/ . As I said http://www.example.org/ANYTHING — Helena, Jan 06 '12 at 00:17
The delimiter character should be a character that does not occur, or rarely occurs within the regex. So using `/` as a delimiter for URL-related regexes is a bad choice - use `#` or something, it will make your regex a lot more readable and debugable. — DaveRandom, Jan 06 '12 at 00:22
Since you are asking regex question after regex question, I think it might be time to enlighten you on [some tools that aid in constructing them](http://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world) -or- [online tools](http://stackoverflow.com/questions/2491930/is-there-an-online-regexbuddy-like-regular-expression-analyzer) and http://regular-expressions.info/ for an introduction. Be sure to read up on when to use [DOM vs. regex](http://stackoverflow.com/a/3650431/345031) (depends on proficiency, and if the html is always normalized+well-known). — mario, Jan 06 '12 at 00:26
you may want to parse the document using DOMDocument first to get the anchor tags. your above regex fails to find many possible anchor tags such as ``. it assumes the tag has no attributes before the href attribute. — dqhendricks, Jan 06 '12 at 00:56

score 3 · Answer 1 · answered Jan 06 '12 at 00:19

3

ANYTHING is usually represented by .*? (which you already use in your original regex). You could also use [^"]+ as placeholder in your case.

answered Jan 06 '12 at 00:19

mario

I have edited my question. It would be kind if you could take a look at it again. Thanks. – Helena Jan 06 '12 at 00:20
Well, then that's not your problem. Be sure to check out the related links. – mario Jan 06 '12 at 00:21

score 0 · Answer 2 · answered Jan 06 '12 at 00:20

It sounds like you want the following regular expression:

'/<a href="(http:\/\/example\.org\/.*?)".*?>.*?<\/a>/ms'

You can also use a different delimiter to avoid escaping the backslashes:

'#<a href="(http://example\.org/.*?)".*?>.*?</a>#ms'

Note the escaping of the . in the domain name, as you intend to match a literal ., not any character.

score 0 · Answer 3 · answered Jan 06 '12 at 00:20

I think this should help

/<a href="(http:\/\/www.example.org\/.*?)".*?>.*?<\/a>/ms
<a href="http://www.example.org/ANYTHING">text</a>

Result:

Array
(
    [0] => <a href="http://www.example.org/ANYTHING">text</a>
    [1] => http://www.example.org/ANYTHING
)

EDIT: I always find this site very useful for when i want to try out preg_match - http://www.solmetra.com/scripts/regex/index.php

3 Answers3