1

I'm trying to follow a tutorial for web scraping with php.

I understand roughly whats going on, but I don't get how to filter what has been scraped to get exactly what I want. For example:

<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>

I see that the (.*) will retrieve everything in between title tags, can I use regular expressions to get specific info. Say inside he title had Welcome visitor #100 how would I get the number that comes after the hash?

Or do I have to retrieve everything between the tags then manipulate it later?

Toto
  • 89,455
  • 62
  • 89
  • 125
mao
  • 1,059
  • 2
  • 23
  • 43

3 Answers3

3

Given the title "Welcome visitor #100" and the fact a <title> tag occurs no more than once, the expression should be:

preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);

A lot of people on SO would argue to never use regular expressions to parse (X)HTML; for this task, however, the above should suffice.

Although - as mentioned before - a <title> tag (should) occur no more than once, the pattern

<title>(.*)</title>

would as well match this:

<title>Welcome visitor <title>#<title>100blafoobar</title>

(.*) being the part allowing this. As soon as the page you're scraping your data from changes, the regex might stop working.


EDIT: A method to correctly sift out multiple elements and their attributes:

$dom = new DomDocument;
$dom->loadHTML($page_content);

$elements = $dom->getElementsByTagName('a');

for ($n = 0; $n < $elements->length; $n++) {
    $item = $elements->item($n);
    $href = $item->getAttribute('href');
}
Community
  • 1
  • 1
Linus Kleen
  • 33,871
  • 11
  • 91
  • 99
  • ok,thanks for explaining. Any tips on how to approach tags that occur multiple times. Can I access them using their parents in the DOM? – mao Feb 23 '12 at 00:25
2

You would just need to change the regex to match whatever you need. If you are going to use the tile more than once it's better to save the whole and manipulate it later, otherwise just get what you need.

/<title>.*((?<=#)\d*).*<\/title>/i

Would specifically match a number after a hash. It would not match a number without a hash.

There are many ways to write regex, it depends on how general or specific you want to be.

You could also write like this to get any number:

/<title>.*(\d)*.*<\/title>/i

Kassym Dorsel
  • 4,773
  • 1
  • 25
  • 52
  • thankyou, what if there were more than one hash, is there anyway to specify the last hash? – mao Feb 23 '12 at 00:26
  • The first will only match a hash followed by numbers. If you have a hash followed by anything but numbers it won't make a difference – Kassym Dorsel Feb 23 '12 at 00:36
  • thanks, but I'm asking incase it's not a hash but a space or a character which is common and I might want to find the numbers after the last instance. I think I better go read about using regular expressions. – mao Feb 23 '12 at 00:47
0

I would first fetch the title tag and then process the title further. The other answers contain perfectly valid solutions for this task.

Some further notes:

  • Please use DOMDocument for such things, since it is much safer (your regular expression might break on some specific HTML pages)
  • Please use the non-greedy version of .*: .*?, otherwise you will run into funny things like:

    <html>
        <head>
            <title>a</title>
        </head>
        <body>
            <title>test</title> <!-- not allowed in HTML, but since when does the web pages online actually care about that? -->
        </body>
    </html>
    

You will now match everything between <title>a</title>... up to <title>test</title>, including everything in between.

apfelbox
  • 2,625
  • 2
  • 24
  • 26