Get all strings between two other strings in html document in PHP

Question

I'm creating some kind of crawler/proxy at the moment. It can navigate a website and still remain on my website while browsing. But I thought about while loading the website, get all the links and data at the same time.

So the website contains many "< tr>"(without the space) which again contains a lot of other stuff.

Here is 1 example of many on the website:

<tr>
    <td class="vertTh">
        <center>
            <a href="/s/browse/other.php">Other</a>
            <br>
            <a href="/s/browse/documents.php">Document</a>
        </center>
    </td>
    <td>
        <div class="Name">
            <a href="/s/database/Document_Title_Info" class="Link">Document Title Info</a>
        </div>
        <a href="http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
            <img src="/static/img/icon-source.png" alt="Source">
        </a>
        <font class="Desc">Uploaded 03-24&nbsp;14:02, Size 267.35&nbsp;KB, ULed by <a class="Desc" href="/s/user/username/" title="Browse username">username</a></font>
    </td>
    <td align="right">67</td>
    <td align="right">9</td>
</tr>

Users browse the proxy site, and while they do, it catches info from the original website. I figured out how to get a string between two words, but I don't know how to make this to a "foreach" code or something else.

So let's say I want to get the source link. Then I would do something like this:

$url = $_GET['url'];
$str = file_get_contents('https://database.com/' . $url);

$source = 'http://example.com/source/to/' . getStringBetween($str,'example.com/source/to/','" title="Source">'); // Output looking like this: http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters

function getStringBetween($str,$from,$to)
{
    $sub = substr($str, strpos($str,$from)+strlen($from),strlen($str));
    return substr($sub,0,strpos($sub,$to));
}

But I can't just do this, because there are multiple of these strings. So I'm wondering if there is any kind of way I can get Source, name and size on all of these strings?

Possible duplicate of [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) — user3942918, Mar 28 '17 at 22:55
@PaulCrovella Maybe. But it doesn't exactly explain how to get many different and same strings at once. But I can agree that the title might be a bit duplicate of that thread. — Typewar, Mar 28 '17 at 23:22

sebi · Answer 1 · 2017-09-07T10:35:40.670

You might want to use preg_match_all so that you get a list of many matches. Then you can loop over it.

http://php.net/manual/en/function.preg-match-all.php

$html = '<tr>
    <td class="vertTh">
        <center>
            <a href="/s/browse/other.php">Other</a>
            <br>
            <a href="/s/browse/documents.php">Document</a>
        </center>
    </td>
    <td>
        <div class="Name">
            <a href="/s/database/Document_Title_Info" class="Link">Document Title Info</a>
        </div>
        <a href="http://another-example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
            <img src="/static/img/icon-source.png" alt="Source">
        </a>
        <a href="http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
            <img src="/static/img/icon-source.png" alt="Source">
        </a>
        <font class="Desc">Uploaded 03-24&nbsp;14:02, Size 267.35&nbsp;KB, ULed by <a class="Desc" href="/s/user/username/" title="Browse username">username</a></font>
    </td>
    <td align="right">67</td>
    <td align="right">9</td>
</tr>';

// use | as delimiter for pattern to make it a little cleaner
preg_match_all('|href="(http://.+?)" title="Source"|', $html, $matches);
// loop over $matches
var_dump($matches);

foreach ($matches[1] as $match) {
    // $match == http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters
}

You can try this example at... http://phpfiddle.org/ or run it in a .php file locally. Good luck.

FYI: I added an extra anchor tag to illustrate finding another source.

Seams interesting. I'll check it out tomorrow. Need some sleep now. But the one thing that confuses me here, are the weird codes. Like: preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);. It's basically greek at the moment. — Typewar, Mar 28 '17 at 23:17

Get all strings between two other strings in html document in PHP

1 Answers1