0

I'm trying scrape listing information from Craigslist, unfortunately I can't seem to get the images since they are in a slideshow.

import requests
from bs4 import BeautifulSoup as soup

url = "https://newyork.craigslist.org/search/sss"
r = requests.get(url)
souped = soup(r.content, 'lxml')

Since the images aren't even in the html file requested, do I need to somehow dynamically load the page or something. If so can I keep it only in python, I don't want any other dependencies. Thanks in advance, pretty new to this so any help would be helpful.

Da Jankee
  • 9
  • 5
  • As you can see you have the links to the images, I suggest you extract the URLs and then use `requests` to download the image using those URLs. See [this post](https://stackoverflow.com/questions/13137817/how-to-download-image-using-requests) for downloading images with that module – Plopp Feb 06 '19 at 12:55
  • Thanks but I'm not looking to download the images just want the links. I have a loop that gets the title, location, price, etc of listings to a CSV file, I just want it to add the link(s) of the images to it as well. And sorry I'm a noob at python so a simple solution would be helpful. – Da Jankee Feb 06 '19 at 13:09

1 Answers1

2

Look for the A tags with classes result-image gallery. Each of those tags have a data-ids attribute which olds part of the names of the images files.

<a href="https://newyork.craigslist.org/mnh/fuo/d/new-york-city-3-piece-shaped-ikea-couch/6812749499.html" class="result-image gallery" data-ids="1:00707_iRUU5VKwkWi,1:00H0H_6AIBqK2iQDU">
           ....
</a>

Now, if you want to get the urls, first get that attribute and parse the partial image's names (on that example, 00707_iRUU5VKwkWi and 00H0H_6AIBqK2iQDU).

And now you can build the urls with the host and, the suffix (_300x300) and the extension:

https://images.craigslist.org/00707_iRUU5VKwkWi_300x300.jpg
https://images.craigslist.org/00H0H_6AIBqK2iQDU_300x300.jpg
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
arieljuod
  • 15,460
  • 2
  • 25
  • 36