Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src
tag?
Background: I am trying to use Scrapy to crawl a site, pull any links under the img src
tag, convert relative paths to absolute paths, and then produce the absolute paths in CSV or the list data type. I plan on combining the above function with actually downloading files using Scrapy and concurrently crawling for links, but I'll cross that bridge when I get to it. For reference, here are some other details about the hypothetical target site:
- The relative paths look like
img src="/images/file1.jpg"
, where images is a directory (www.example.com/products/images) that cannot be directly crawled for file paths. - The relative paths for these images do not follow any logical naming convention (e.g., file1.jpg, file2.jpg, file3.jpg).
- The image types differ across files, with PNG and JPG being the most common.
Problems experienced: Even after thoroughly reading the Scrapy documentation and going through a ton of fairly dated Stackoverflow questions [e.g., this question], I can't seem to get the precise output I want. I can pull the relative paths and reconstruct them, but the output is off. Here are the issues I've noticed with my current code:
In the CSV output, there are both populated rows and blank rows. My best guess is that each row represents the results of scraping a particular page for relative paths, which would mean a blank row is a negative result.
Each non-blank row in the CSV contains a list of URLs delimited by commas, whereas I would simply like an individual, non-duplicative value in a row. The population of a row with a comma-delimited list seems to support my suspicions about what is going on under the hood.
Current code: I execute the following code in the command line using 'scrapy crawl relpathfinder -o output.csv -t csv'.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
class MySpider(CrawlSpider):
name='relpathfinder'
allowed_domains=['example.com']
start_urls=['https://www.example.com/']
rules = (Rule(LinkExtractor(allow=()), callback='url_join', follow=True),)
def url_join(self,response):
item=MyItem()
item['url']=[]
relative_url=response.xpath('//img/@src').extract()
for link in relative_url:
item['url'].append(response.urljoin(link))
yield item
Thank you!