0

I'm creating some kind of crawler/proxy at the moment. It can navigate a website and still remain on my website while browsing. But I thought about while loading the website, get all the links and data at the same time.

So the website contains many "< tr>"(without the space) which again contains a lot of other stuff.

Here is 1 example of many on the website:

<tr>
    <td class="vertTh">
        <center>
            <a href="/s/browse/other.php">Other</a>
            <br>
            <a href="/s/browse/documents.php">Document</a>
        </center>
    </td>
    <td>
        <div class="Name">
            <a href="/s/database/Document_Title_Info" class="Link">Document Title Info</a>
        </div>
        <a href="http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
            <img src="/static/img/icon-source.png" alt="Source">
        </a>
        <font class="Desc">Uploaded 03-24&nbsp;14:02, Size 267.35&nbsp;KB, ULed by <a class="Desc" href="/s/user/username/" title="Browse username">username</a></font>
    </td>
    <td align="right">67</td>
    <td align="right">9</td>
</tr>

Users browse the proxy site, and while they do, it catches info from the original website. I figured out how to get a string between two words, but I don't know how to make this to a "foreach" code or something else.

So let's say I want to get the source link. Then I would do something like this:

$url = $_GET['url'];
$str = file_get_contents('https://database.com/' . $url);

$source = 'http://example.com/source/to/' . getStringBetween($str,'example.com/source/to/','" title="Source">'); // Output looking like this: http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters

function getStringBetween($str,$from,$to)
{
    $sub = substr($str, strpos($str,$from)+strlen($from),strlen($str));
    return substr($sub,0,strpos($sub,$to));
}

But I can't just do this, because there are multiple of these strings. So I'm wondering if there is any kind of way I can get Source, name and size on all of these strings?

Typewar
  • 825
  • 1
  • 13
  • 28
  • Possible duplicate of [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – user3942918 Mar 28 '17 at 22:55
  • @PaulCrovella Maybe. But it doesn't exactly explain how to get many different and same strings at once. But I can agree that the title might be a bit duplicate of that thread. – Typewar Mar 28 '17 at 23:22

1 Answers1

-1

You might want to use preg_match_all so that you get a list of many matches. Then you can loop over it.

http://php.net/manual/en/function.preg-match-all.php

$html = '<tr>
    <td class="vertTh">
        <center>
            <a href="/s/browse/other.php">Other</a>
            <br>
            <a href="/s/browse/documents.php">Document</a>
        </center>
    </td>
    <td>
        <div class="Name">
            <a href="/s/database/Document_Title_Info" class="Link">Document Title Info</a>
        </div>
        <a href="http://another-example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
            <img src="/static/img/icon-source.png" alt="Source">
        </a>
        <a href="http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
            <img src="/static/img/icon-source.png" alt="Source">
        </a>
        <font class="Desc">Uploaded 03-24&nbsp;14:02, Size 267.35&nbsp;KB, ULed by <a class="Desc" href="/s/user/username/" title="Browse username">username</a></font>
    </td>
    <td align="right">67</td>
    <td align="right">9</td>
</tr>';

// use | as delimiter for pattern to make it a little cleaner
preg_match_all('|href="(http://.+?)" title="Source"|', $html, $matches);
// loop over $matches
var_dump($matches);

foreach ($matches[1] as $match) {
    // $match == http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters
}

You can try this example at... http://phpfiddle.org/ or run it in a .php file locally. Good luck.

FYI: I added an extra anchor tag to illustrate finding another source.

sebi
  • 79
  • 6
  • Seams interesting. I'll check it out tomorrow. Need some sleep now. But the one thing that confuses me here, are the weird codes. Like: preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);. It's basically greek at the moment. – Typewar Mar 28 '17 at 23:17