Crawling and Scraping Wiki:Picture of the day

Question

I am trying to work on a pet project that needs me to crawl through a list of Wikipedia: Picture of the day pages by month. As an example: https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/May_2004 has a list of images followed by a brief caption for each image. I want to do the following 2 things here:

Scrape all the images from the page and the respective caption. (Preferably a dictionary to store an Image: Caption pair)
Crawl through other months and repeat 1.

Any help on how to accomplish this would be highly appreciated.

Thank you very much.

What have you tried so far? SO is not a code-writing service, please show us something that we may help with. — h4z3, Feb 07 '20 at 10:23

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

I suggest you using scrapy in python, as it's much lighter than f.e. selenium. In function parse you can look for all img tags, like here, where after getting html of site given. Here you can print all found links of images and texts, as all texts we need are in <p> tags, or save them to file if needed.

import scrapy
from scrapy.crawler import CrawlerProcess
import logging

class Spider(scrapy.Spider):
   def __init__(self):
      self.name = "WikiScraper"
      self.start_urls = ["https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/May_2004"] # Here you can add more links or generate them
   def parse(self, response):
      for src in response.css('img::attr(src)').extract():
         print("Image:", src)
      for text in response.css('p *::text'):
         print("Text:", text.extract())

if __name__ == "__main__":
   logging.getLogger('scrapy').propagate = False
   process = CrawlerProcess()
   process.crawl(Spider)
   process.start()

Lastly you have to join all text that should be joined together (I didn't have time to do it) and add all websites you need. All rest I didn't mention you can find on scrapy.

Hope I didn't miss anything!

Thanks for your quick help! But it seems like I have run into a small problem executing the code as it throws: ReactorNotRestartable Traceback (most recent call last) in () 20 process = CrawlerProcess() 21 process.crawl(Spider) ---> 22 process.start() — Soumya Ranjan Sahoo, Feb 07 '20 at 11:22
I'm not sure how to help with that tho, you have to test few things. Probably start with [that](https://stackoverflow.com/questions/41495052/scrapy-reactor-not-restartable) — Seba1583, Feb 07 '20 at 11:59
Hi, is it possible to get the image and its corresponding text field within
as a python dict? The current
selector selects the unwanted
fields. I want to generate an image-caption mapping. Thanks! — Soumya Ranjan Sahoo, Apr 05 '20 at 12:30

Crawling and Scraping Wiki:Picture of the day

1 Answers1