1

I'm working on converting a website. It involved standardizing the directory structure of images and media files. I'm parsing path information from various tags, standardizing them, checking to see if the media exists in the new standardized location, and putting it there if it doesn't. I'm using string manipulation to do so.

This is a little open-ended, but is there a class, tool, or concept out there I can use to save myself some headaches? For instance, I'm running into problems where, say, a page in a sudirectory (website.com/subdir/dir/page.php) has relative image paths (../images/image.png), or other kinds of things like this. It's not like there's one overarching problem, but just a lot of little things that add up.

When I think I've got my script covering most cases, then I get errors like Could not find file at export/standardized_folder/proper_image_folderimage.png where it should be export/standardized_folder/proper_image_folder/image.png. It's kind of driving me mad, doing string parsing and checks to make sure that directory separators are in the proper places.

I feel like I'm putting too much work into making a one-off import script very robust. Perhaps someone's already untangled this mess in a re-useable way, one which I can take advantage of?

Post Script: So here's a more in-depth scoop. I write my script that parses one "type" of page and pulls content from the same of its kind. Then I turn my script to parse another type of page, get all knids of errors, and learn that all my assumptions about how paths are referenced must be thrown out the window. Wash, rinse, repeat.

So I'm looking at doing some major re-factoring of my script, throwing out all assumptions, and checking, re-checking, and double-checking path information. Since I'm really trying to build a robust path building script, hopefully I can avoid re-inventing the wheel. Is there a wheel out there?

user151841
  • 17,377
  • 29
  • 109
  • 171
  • So you are reading out the image URLs from the HTML sourcecode and you have problems to map the relative paths to the document base URL? – hakre Sep 16 '11 at 15:07
  • Some are relative and some are absolute. It's just a big mess. – user151841 Sep 16 '11 at 15:09
  • How do you aquire the HTML source? Is it via DomDocument? – hakre Sep 16 '11 at 15:17
  • I'm using simple_html_dom (http://simplehtmldom.sourceforge.net/) to parse the pages downloaded from a site scrape with wget. – user151841 Sep 16 '11 at 15:23
  • I wonder if you're wasting _more_ time on a one-off script than it would take to do it manually. I only bring it up because I've actually done that. – Herbert Sep 16 '11 at 15:29
  • It sure feels like it sometimes, but there's some 1700 pages that I'm pulling stuff from. So I'm actually not :P – user151841 Sep 16 '11 at 15:39

2 Answers2

1

If your problems have their root in resolving the relative links from a document and resolve to an absolute one (which should be half the job to map the linked images paths onto the file-system), I normally use Net_URL2 from pear. It's a simple class that just does the job.

To install, as root just call

# pear install channel://pear.php.net/Net_URL2-0.3.1

Even if it's a beta package, it's really stable.

A little example, let's say there is an array with all the images srcs in question and there is a base-URL for the document:

require_once('Net/URL2.php');

$baseUrl = 'http://www.example.com/test/images.html';

$docSrcs = array(...);

$baseUrl = new Net_URL2($baseUrl);

foreach($docSrcs as $href)
{
    $url = $baseUrl->resolve($href);
    echo ' * ', $href, ' -> ', $url->getURL(), "\n";
    // or
    echo " $href -> $url\n"; # Net_URL2 supports string context
}

This will convert any relative links into absolute ones based on your base URL. The base URL is first of all the documents address. The document can override it by specifying another one with the base elementDocs. So you could look that up with the HTML parser you're already using (as well as the src and href values).

Net_URL2 reflects the current RFC 3986 to do the URL resolving.

Another thing that might be handy for your URL handling is the getNormalizedURL function. It does remove some potential error-cases like needless dot segments etc. which is useful if you need to compare one URL with another one and naturally for mapping the URL to a path then:

foreach($docSrcs as $href)
{
    $url = $baseUrl->resolve($href);
    $url = $url->getNormalizedURL();
    echo " $href -> $url\n";
}

So as you can resolve all URLs to absolute ones and you get them normalized, you can decide whether or not they are in question for your site, as long as the url is still a Net_URL2 instance, you can use one of the many functions to do that:

$host = strtolower($url->getHost());
if (in_array($host, array('example.com', 'www.example.com'))
{
    # URL is on my server, process it further
}

Left is the concrete path to the file in the URL:

$path = $url->getPath();

That path, considering you're comparing against a UNIX file-system, should be easy to prefix with a concrete base directory:

$filesystemImagePath = '/var/www/site-new/images';
$newPath = $filesystemImagePath . $path;
if (is_file($newPath))
{
    # new image already exists.
}

If you've got problems to combine the base path with the image path, the image path will always have a slash at the beginning.

Hope this helps.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
0

Truepath() to the rescue! No, you shouldn't use realpath() (see why).

Community
  • 1
  • 1
Christian
  • 27,509
  • 17
  • 111
  • 155
  • You should "glance" better @NikiC... Apparently I wasn't clear enough in how it works. It doesn't work over network resources by design. It only works on local file system, no UNC, no HTTP and no FTP. :) – Christian Sep 16 '11 at 15:17
  • @hakre - WTF?! It does that specifically. Are you blind or something? – Christian Sep 16 '11 at 15:23
  • How to provide the baseurl for the relative links with truepath? – hakre Sep 16 '11 at 15:25
  • @Christian: You probably should make more clear that truepath() has nothing and absolutely nothing to do with realpath() ;) You seem to call it a realpath replacement, though it is in no way. It's only a local file path normalization function. (PS: I removed my downvote here, as in this situation the function might be the right thing.) – NikiC Sep 16 '11 at 15:30
  • @hakre - You provide it once at the beginning of the string. If it's not needed, it's not used. But I see your point, you have to add the path yourself. Though it isn't such a big deal... – Christian Sep 16 '11 at 15:30
  • @NikiC a "replacement" implies acquiring `realpath`'s nice features including crashing, failing on files it worked correctly on a few seconds earlier etc, so yeah, it isn't a `realpath` replacement, thank god. – Christian Sep 16 '11 at 15:31
  • @Christian: That's not exactly what I meant (PS: Did you try clearing the real path cache when you experienced the issues?). `realpath`'s job is to give you the real path of a file (yeah, that's why they called it like that). You give it a path and it will return you where exactly it is located, with all symlinks resolved and so on. Truepath() on the other hand is for path normalization. It doesn't replace it, it provides additional / different (useful) functionality. – NikiC Sep 16 '11 at 15:36
  • @Christian: The resolvement of relative URLs to absolute URLs is not the same as on common file-systems. – hakre Sep 16 '11 at 16:04
  • @NikiC it does resolve symlinks (see line 41, readlink()). It just does normalization first. It does replace the path eventually, if it is in fact a symlink. **NB:** Yes I did clear the cache, and it fixed an issue occasionally, does that imply software that uses caching should be faulting every now and then? Did you try my suggestion at searching for PHP bugs directly related to `realpath`? You should :). – Christian Sep 16 '11 at 21:38
  • @hakre - I don't understand what you're saying. I already said this is not for URLs but for files. If the OP wants to use this for URLs, he just has to `str_replace(array('http://', 'https://', DOMAIN), array('','',DOC_ROOT))`, and perhaps additionally replace directory separator if on Windows. – Christian Sep 16 '11 at 21:38