0

I have a routine that looks at a base domain URL (http://www.site.com), finds all the links, and then finds all the images and their attributes for each page. This is done in two for loops:

  • one first for the links, and inside each loop of each link
  • one for each image found on each page.

I've been using my band's website as a test bed, and each page at the top has a "spotlight" section of featured articles, which is setup as an image slider. So, I only want unique image url's for a website, but every thing I am trying is still letting duplicates through. I had tried doing the dupe check while building the array, but that was fruitless. But then I found this link: How to remove duplicate values from a multi-dimensional array in PHP and comment, but this does not work either.

Let's start with a sample array of data I scraped from my band's website:

Array
(
[http://darwenstheory.com/] => Array
    (
        [0] => Array
            (
                [3] => Array
                    (
                        [url] => http://darwenstheory.com/images/dtheory-spotlight-vidclips.jpg
                        [alt] => Ventura Theater Video Clips Posted!
                        [w] => 644
                        [h] => 202
                        [ratio] => 3.2
                    )

            )

        [1] => Array
            (
                [3] => Array
                    (
                        [url] => http://darwenstheory.com/images/dtheory-spotlight-vtpix.jpg
                        [alt] => Video Clips Posted!
                        [w] => 644
                        [h] => 202
                        [ratio] => 3.2
                    )

            )

        [2] => Array
            (
                [3] => Array
                    (
                        [url] => http://darwenstheory.com/images/dtheory-spotlight-merch.jpg
                        [alt] => Photos from Ventura Theater!
                        [w] => 644
                        [h] => 202
                        [ratio] => 3.2
                    )

            )

        [3] => Array
            (
                [4] => Array
                    (
                        [url] => http://darwenstheory.com/wp-content/uploads/2011/10/peepdestroyflyer.jpg
                        [alt] => 
                        [w] => 533
                        [h] => 800
                        [ratio] => 0.7
                    )

            )
[http://darwenstheory.com/2011/01/11/ventura-theater-video-clips-posted/] => Array
    (
        [0] => Array
            (
                [3] => Array
                    (
                        [url] => http://darwenstheory.com/images/dtheory-spotlight-vidclips.jpg
                        [alt] => Ventura Theater Video Clips Posted!
                        [w] => 644
                        [h] => 202
                        [ratio] => 3.2
                    )

            )

        [1] => Array
            (
                [3] => Array
                    (
                        [url] => http://darwenstheory.com/images/dtheory-spotlight-vtpix.jpg
                        [alt] => Video Clips Posted!
                        [w] => 644
                        [h] => 202
                        [ratio] => 3.2
                    )

            )

        [2] => Array
            (
                [3] => Array
                    (
                        [url] => http://darwenstheory.com/images/dtheory-spotlight-merch.jpg
                        [alt] => Photos from Ventura Theater!
                        [w] => 644
                        [h] => 202
                        [ratio] => 3.2
                    )

            )

In the array above, I should not have the first three image URL's for the 2nd index (which is a URL of a sub-page on the domain). Simplified version of what I am using to build the array:

foreach($links as $link)
{
    $images = get_page_images($link); //array;
    foreach($images as $image)
    {
        //i have some things here to setup a "score" for each image
        $data['scrape'][$link][][$score] = array('url' => $image['url'], 'alt' => $image['alt'], 'w' => $image['w'], 'h' => $image['h'], $ratio);
    }
}

I have a feeling I am over-complicating this, but I have no idea how or why. I'm here to learn, whether it's me being stupid or something else.

I would just like the above array I am building to not have a duplicate value for the 'url' key in the deepest-level array.

Thank you so, so much in advance for criticism, help, and every thing.

Community
  • 1
  • 1
Kinsbane
  • 13
  • 2

2 Answers2

0

I'd still do the dupe check while building the array:

$urls = array();

foreach($links as $link)
{
    $images = get_page_images($link); //array;
    foreach($images as $image)
    {
        if (!$urls[$image['url']])       // <- dupe check added
        {
            $urls[$image['url']] = true; // <- dupe check added

            //i have some things here to setup a "score" for each image
            $data['scrape'][$link][][$score] = array('url' => $image['url'], 'alt' => $image['alt'], 'w' => $image['w'], 'h' => $image['h'], $ratio);
        }
    }
}
AndreKR
  • 32,613
  • 18
  • 106
  • 168
0

That's a lot to look at, but off the bat I might suggest starting a base array to compare each iteration to and only add to the array if the key does not exist in the base array...

$image_arr = array();
foreach($links as $link)
{

  $images = get_page_images($link); //array;
  foreach($images as $image)
  {
      if(!in_array($image['url'], $image_arr))
      {  
            //i have some things here to setup a "score" for each image
            $data['scrape'][$link][][$score] = array('url' => $image['url'], 'alt' => $image['alt'], 'w' => $image['w'], 'h' => $image['h'], $ratio);
            $images_arr[$image['url'] = $image['url';
      }
  }
}
Kai Qing
  • 18,793
  • 5
  • 39
  • 57
  • Thanks. I had thought being able to just "check-on-the-fly", so to speak, would work by just checking the actual array I was building, rather than a separate one. But, I guess if that's what it takes! – Kinsbane Nov 16 '11 at 18:31