2

I have a Goutte/Client (goutte uses symfony for the requests) and I would like to join paths and get a final URL:

$client = new Goutte\Client();
$crawler = $client->request('GET', 'http://DOMAIN/some/path/')
// $crawler is instance of Symfony\Component\DomCrawler\Crawler

$new_path = '../new_page';
$final path = $crawler->someMagicFunction($new_path);
// final path == http://DOMAIN/some/new_page

What I'm looking for is an easy way join the $new_path variable with he current page from the request and get the new URL.

Note that $new_page can be any of:

new_page    ==> http://DOMAIN/some/path/new_page
../new_page ==> http://DOMAIN/some/new_page
/new_page   ==> http://DOMAIN/new_page

Does symfony/goutte/guzzle gives any easy way to do so?

I found the getUriForPath from Symfony\Component\HttpFoundation\Request, but I don't see any easy way to convert the Symfony\Component\BrowserKit\Request to the HttpFoundation\Request

Dekel
  • 60,707
  • 10
  • 101
  • 129
  • you really need to canonize the url's path? guzzle should be able to handle a request to `http://DOMAIN/some/path/../new_page` without problems – Federkun Nov 27 '16 at 15:09
  • Yeah, I need it for some other validations (and not for a specific request). Also - if the `$new_page` is `/new_page` I might have some problem with the final URL. – Dekel Nov 27 '16 at 15:11

2 Answers2

5

Use Uri::resolve() from guzzlehttp/prs7 package. This method allows you to create an normalised URL from a base and and relative parts.

An example (using excellent psysh shell):

Psy Shell v0.7.2 (PHP 7.0.12 — cli) by Justin Hileman
>>> $base = new GuzzleHttp\Psr7\Uri('http://example.com/some/dir')
=> GuzzleHttp\Psr7\Uri {#208}
>>> (string) GuzzleHttp\Psr7\Uri::resolve($base, '/new_base/next/next/../../back_2')
=> "http://example.com/new_base/back_2"

Also take a look at UriNormalizer class. There is an example (test case) that is connected to your issue.

From the test case:

$uri = new Uri('http://example.org/../a/b/../c/./d.html');
$normalizedUri = UriNormalizer::normalize($uri, UriNormalizer::REMOVE_DOT_SEGMENTS);

$this->assertSame('http://example.org/a/c/d.html', (string) $normalizedUri);
Alexey Shokov
  • 4,775
  • 1
  • 21
  • 22
  • I'm not sure how do you handle `http://example.org/page/` joined with `/new_page` (where the final url should be `http://example.org/new_page`). Can you explain? – Dekel Nov 28 '16 at 11:47
  • You are right. Just updated the answer the correct solution with `guzzlehttp/prs7`. – Alexey Shokov Nov 28 '16 at 12:40
  • Thanks. It seems like the goutte version I'm using is a bit old and doesn't have the latest version of guzzle (which has the prs7 and the UriResolve). But you got my upvote here :) Thanks again for your help! – Dekel Nov 28 '16 at 12:44
  • You are welcome. Just install the psr7 package separately, it doesn't depend on the new Guzzle :) So you are able to use your current Goutte and the psr7 package. – Alexey Shokov Nov 28 '16 at 12:49
1

You can use parse_url to get the url's path:

$components = parse_url('http://DOMAIN/some/path/');
$path = $components['path'];

then you need a way to canonize it. This answer can help you:

function normalizePath($path, $separator = '\\/')
{
    // Remove any kind of funky unicode whitespace
    $normalized = preg_replace('#\p{C}+|^\./#u', '', $path);

    // Path remove self referring paths ("/./").
    $normalized = preg_replace('#/\.(?=/)|^\./|\./$#', '', $normalized);

    // Regex for resolving relative paths
    $regex = '#\/*[^/\.]+/\.\.#Uu';

    while (preg_match($regex, $normalized)) {
        $normalized = preg_replace($regex, '', $normalized);
    }

    if (preg_match('#/\.{2}|\.{2}/#', $normalized)) {
        throw new LogicException('Path is outside of the defined root, path: [' . $path . '], resolved: [' . $normalized . ']');
    }

    return trim($normalized, $separator);
}

Everything that's left to do is rebuilding the url, you can see this comment:

function unparse_url($parsed_url) { 
    $scheme   = isset($parsed_url['scheme']) ? $parsed_url['scheme'] . '://' : ''; 
    $host     = isset($parsed_url['host']) ? $parsed_url['host'] : ''; 
    $port     = isset($parsed_url['port']) ? ':' . $parsed_url['port'] : ''; 
    $user     = isset($parsed_url['user']) ? $parsed_url['user'] : ''; 
    $pass     = isset($parsed_url['pass']) ? ':' . $parsed_url['pass']  : ''; 
    $pass     = ($user || $pass) ? "$pass@" : ''; 
    $path     = isset($parsed_url['path']) ? $parsed_url['path'] : ''; 
    $query    = isset($parsed_url['query']) ? '?' . $parsed_url['query'] : ''; 
    $fragment = isset($parsed_url['fragment']) ? '#' . $parsed_url['fragment'] : ''; 
    return "$scheme$user$pass$host$port/$path$query$fragment"; 
}

Final path:

$new_path = '../new_page';

if (strpos($new_path, '/') === 0) { // absolute path, replace it entirely
    $path = $new_path;
} else { // relative path, append it
    $path = $path . $new_path;
}

Put it all together:

// http://DOMAIN/some/new_page
echo unparse_url(array_replace($components, array('path' => normalizePath($path))));
Community
  • 1
  • 1
Federkun
  • 36,084
  • 8
  • 78
  • 90
  • Thanks for the answer, I was hoping Symfony will give an easier solution for this. Hope you don't mind - I'll wait a bit more before mark this as the correct answer, maybe someone will have a better solution. – Dekel Nov 27 '16 at 15:51
  • I'm not sure how do you handle `http://example.org/page/` joined with `/new_page` (where the final url should be `http://example.org/new_page`). Can you explain? – Dekel Nov 28 '16 at 11:47
  • The last example (`echo resolveUrl('http://example.org/page/', '/new_page'), "\n";`) gives `http://example.org/page` instead of `http://example.org/new_page`. – Dekel Nov 28 '16 at 12:15