1

Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag?

Background: I am trying to use Scrapy to crawl a site, pull any links under the img src tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For reference, here are some other details about the hypothetical target site:

  • The relative paths look like img src="/images/file1.jpg", where images is a directory (www.example.com/products/images) that cannot be directly crawled for file paths.
  • The relative paths for these images do not follow any logical naming convention (e.g., file1.jpg, file2.jpg, file3.jpg).
  • The image types differ across files, with PNG and JPG being the most common.

Problems experienced: Even after thoroughly reading the Scrapy documentation and going through a ton of fairly dated Stackoverflow questions [e.g., this question], I can't seem to get the precise output I want. I can pull the relative paths and reconstruct them, but the output is off. Here are the issues I've noticed with my current code:

  • In the CSV output, there are both populated rows and blank rows. My best guess is that each row represents the results of scraping a particular page for relative paths, which would mean a blank row is a negative result.

  • Each non-blank row in the CSV contains a list of URLs delimited by commas, whereas I would simply like an individual, non-duplicative value in a row. The population of a row with a comma-delimited list seems to support my suspicions about what is going on under the hood.

Current code: I execute the following code in the command line using 'scrapy crawl relpathfinder -o output.csv -t csv'.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class MySpider(CrawlSpider):
    name='relpathfinder'
    allowed_domains=['example.com']
    start_urls=['https://www.example.com/']
    rules = (Rule(LinkExtractor(allow=()), callback='url_join', follow=True),)

    def url_join(self,response):
        item=MyItem()
        item['url']=[]
        relative_url=response.xpath('//img/@src').extract()
        for link in relative_url:
            item['url'].append(response.urljoin(link))
        yield item

Thank you!

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
Tigelle
  • 58
  • 14
  • Both of your 'experienced' problems are quite explicit in your code. 1) You always `yield' and item; 2) Your item contains a list of urls rather than a url. – de1 Jan 01 '18 at 17:27
  • @de1 Thanks for taking a look. Is the solution, then, to simply yield item['url']? I'm still having some trouble fully grasping what each component of the spider does and what is happening under the hood in my projects. I've gotten part of the way there, though the output piece is unclear to me. – Tigelle Jan 01 '18 at 18:02
  • @Tigelle so you want a row per new `src` url? – eLRuLL Jan 01 '18 at 18:07
  • @eLRuLL That's correct. I do want to dedupe, but the important factor is not getting a list of URLs back in each CSV row. – Tigelle Jan 01 '18 at 18:36
  • @Tigelle, You can use an Item Pipeline to deal with the duplicate items. You can use DuplicatesPipeline https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter to solve your problem. – matiskay Jan 01 '18 at 23:38

2 Answers2

1

I would use an Item Pipeline to deal with the duplicated items.

# file: yourproject/pipelines.py
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.url_seen = set()

    def process_item(self, item, spider):
        if item['url'] in self.url_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.url_seen.add(item['url'])
            return item

And add this pipeline to your settings.py

# file: yourproject/settings.py
ITEM_PIPELINES = {
    'your_project.pipelines.DuplicatesPipeline': 300,
}

Then you just need to run your spider scrapy crawl relpathfinder -o items.csv and the pipeline will Drop duplicate items for you. So will not see any duplicate in your csv output.

matiskay
  • 267
  • 4
  • 9
  • Excellent. I really appreciate this. I read about various pipelines, but I didn't recall this one. Very useful. – Tigelle Jan 02 '18 at 00:02
  • Just added a new question at https://stackoverflow.com/questions/48611682/how-to-use-scrapy-files-pipeline-for-absolute-and-relative-paths-using-xpath-sel. Would love your thoughts. – Tigelle Feb 04 '18 at 19:49
0

What about:

def url_join(self,response):
    item=MyItem()
    item['url']=[]
    relative_url=response.xpath('//img/@src').extract()
    for link in relative_url:
        item['url'] = response.urljoin(link)
        yield item
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Many thanks! That did it on returning the constructed absolute paths as a list. It seems like I had some extraneous code and a forehead slapping indentation issue (doh!). So am I right that in my earlier code for each page my spider scraped from I was creating a list of file paths, which is why I had blank rows in the CSV and lists of URLs in others? – Tigelle Jan 01 '18 at 19:44
  • @Tigelle yeah exactly, you were retuning a single item for several urls. Just remember that the outputted `csv` is just a line by line representation of every item you return. – eLRuLL Jan 01 '18 at 19:46
  • I really appreciate it. I'm guessing for deduping I just need to add a list comprehension or something of the like that tests whether each item is in the running list. Here, again, I bump up against my somewhat limited Python skills and definitely limited Scrapy skills. What would be the thing against which I compare an iteration of item['url']? – Tigelle Jan 01 '18 at 20:33
  • Thank you again for offering this very useful response to my earlier question. As an FYI, I added a similar question that builds off of this question somewhat at: https://stackoverflow.com/questions/48611682/scrapy-enabling-files-pipeline-for-absolute-and-relative-paths. If you can offer any thoughts, I would greatly appreciate it. – Tigelle Feb 07 '18 at 02:29