2

I am using BeautifulSoup for extracting pictures which works well for normal pages. Now I want to extract the picture of the Chromebook from a web page like this

https://twitter.com/banprada/statuses/829102430017187841

The page apparently contains a link to another page with the image. Here is my code for downloading an image from mentioned link but I am only getting the image of the person who posted the link.

import urllib.request
import os
from bs4 import BeautifulSoup

URL = "http://twitter.com/banprada/statuses/829102430017187841"
list_dir="D:\\"
default_dir = os.path.join(list_dir,"Pictures_neu")
opener = urllib.request.build_opener()
urllib.request.install_opener(opener)
soup = BeautifulSoup(urllib.request.urlopen(URL).read())
imgs = soup.findAll("img",{"alt":True, "src":True})
for img in imgs:
   img_url = img["src"]
   filename = os.path.join(default_dir, img_url.split("/")[-1])
   img_data = opener.open(img_url)
   f = open(filename,"wb")
   f.write(img_data.read())
   f.close()

Is there an opportunity to download the image somehow?

Many thanks and regards, Andi

Andi Maier
  • 914
  • 3
  • 9
  • 28
  • the page has JS that is not rendered when you are fetching the webpage with urllib – Alex Fung Feb 09 '17 at 09:50
  • try using JS renderer lib like dryscrape mentioned in [here](http://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – Alex Fung Feb 09 '17 at 09:56
  • Required image located inside an `iframe` which is not present in initial page source. Is it acceptable for you to get solution in `Python` + `selenium`? – Andersson Feb 09 '17 at 10:13
  • Thx for the hints. Python + selenium could be a solution (would be great to have a workable solution) – Andi Maier Feb 09 '17 at 14:55

1 Answers1

0

This is how you can get only mentioned image using Selenium + requests

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import requests

link = 'https://twitter.com/banprada/statuses/829102430017187841'
driver = webdriver.PhantomJS()
driver.get(link)
wait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//iframe[starts-with(@id, 'xdm_default')]")))
image_src = driver.find_element_by_tag_name('img').get_attribute('src')
response = requests.get(image_src).content
with open('C:\\Users\\You\\Desktop\\Image.jpeg', 'wb') as f:
    f.write(response)

If you want to get all the images from all iframes on page (excluding images on initial page source that you can get with your code):

from selenium import webdriver
from selenium.common.exceptions import WebDriverException
import requests
import time

link = 'https://twitter.com/banprada/statuses/829102430017187841'
driver = webdriver.Chrome()
driver.get(link)
time.sleep(5) # To wait until all iframes completely rendered. Might be increased
iframe_counter = 0
while True:
    try:
        driver.switch_to_frame(iframe_counter)
        pictures = driver.find_elements_by_xpath('//img[@src and @alt]')
        if len(pictures) > 0:
            for pic in pictures:
                response = requests.get(pic.get_attribute('src')).content
                with open('C:\\Users\\You\\Desktop\\Images\\%s.jpeg' % (str(iframe_counter) + str(pictures.index(pic))), 'wb') as f:
                    f.write(response)
        driver.switch_to_default_content()
        iframe_counter += 1
    except WebDriverException:
        break

Note, that you can use any webdriver

Andersson
  • 51,635
  • 17
  • 77
  • 129