0

I want to be able to extract the original photos that google has on a search page like so:

https://www.google.com/search?biw=1046&bih=720&tbm=shop&ei=sznVWvq5OcbrzgKPgKLoDA&q=red+dress&oq=red+dress&gs_l=psy-ab.3..0l10.1256.2298.0.2485.9.7.0.0.0.0.238.408.0j1j1.2.0....0...1c.1.64.psy-ab..7.2.407....0.WHO8-4Nhfj0

After doing view inspect I saw that the original photos are connected to the word _image_src but I'm not quite sure how to grab these with beautifulsoup.

For example one of the images is:

_image_src='data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBwgHBgkIBwgKCgkLDRYPDQwMDRsUFRAWIB0iIiAdHx8kKDQsJCYxJx8fLT0tMTU3Ojo6Iys/RD84QzQ5OjcBCgoKDQwNGg8PGjclHyU3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3Nzc3N//AABEIAYYBZgMBIgACEQEDEQH/xAAcAAEBAAIDAQEAAAAAAAAAAAAAAQIFAwQGBwj/xABFEAACAQMCAgcFAwoEBQUBAAAAAQIDBBEFEiExBhNBUWFxgSIykaGxFEJSBxUjJHKSwdHh8DNigqJDc7LC8SU0NURjFv/EABoBAQADAQEBAAAAAAAAAAAAAAABAwQCBQb/xAAvEQEAAgIBAwMDAwMEAwAAAAAAAQIDESEEEjEFIjITQVEUYbFxkdEVM6HwI0KB/9oADAMBAAIRAxEAPwD6KigEoEigoAFwEBSIoAApcAYopkQBgFGAIkUuBgCAowBAUoEBcDAEGClAgwC4JAYBQIChgcNzXp21CVarLEYrifOtU6dandXNWOkUrX7HSeHOFxCrOWG/a4SW1eGM/RdT8oOvz1SpOytJv7FTkotx49bLux29vDuWX4d3ot0Bp1aFOvqqkpPElST4rzZRbJ+F9Mf5aqHTHX6VRRV7KcZNZp1YR9nualtT/ez2evNV6e6pa3FJXFSjUUlnq2trfe8x5c0uWOHI9rcdD9JhDb9neF2bmeV6RdFoUaUp2EJOlHLlRbzw8MpnEZdSs+lExw9x0e1q31zTqd3at8eE4Sxug+5m0PjfQPUVpWq1ouUqdOaxNdsfHHas88dj4ccY+xUKsK9GnVptOE4qUWnnKayX0ttnvXtlQZMh04YtFKAMWQyIBMEwZYIBMdwKQDiKUYAAuAALgJGQEwCjAEwUowBC4LgATAKAABcAQFAESKCgQFAE8y4BQMSlwCRCjBUgIkarpRdO20a4UJYqVIqEX3Z5+XDc/Rm2PJdPK8IWMadR8K7lSjxxxktsvXY5erObzqrqkbtEPN9ErKnXvqcpxyoQ+0PK45m3s+Sy13qJ9JtMpZWOR8v6E6pO8t9YvKUown18du+PCEdqaylzxx4ZMqnSmtK4jCnrd9NSTa20ae3anhyyo4xlNGON7btbh9Pqt1E1lepqbqdKNXqXVp9Y+VPetz9DW9Ialf8AMdKpUVd9avbVJPfw49nHs5I8ppO2yu+pWjOFNVlCoq1GM3Uy3mWXndHlx8RPJEah0emFstE163vYpqjXjmePhL4r547z6H0Qv41bKnSW3ZlxW3lGXP4SWWn3p955L8qtGVex077OlulUaipSwsvbzb5Jd79Tp/k2vbinc3Ok1pQdeNNdW4zUoScfbpyTXN.......

I tried:

from bs4 import BeautifulSoup
import requests
import time
from random import randint
from urllib.parse import urljoin
import urllib.request

#reference for scraping google search https://stackoverflow.com/questions/39354587/scraping-google-news-with-beautifulsoup-returns-empty-results
s="red dress"
time.sleep(randint(0, 2))  # relax and don't let google be angry
r = requests.get("https://www.google.com/search?q="+s+"&tbm=shop")
#print(r)
html=r.content
#print(r.content)
#finding image tags reference here https://www.youtube.com/watch?v=tmgfCJv7dW0
html_text=r.text
soup=BeautifulSoup(html,"html.parser")
print(soup.prettify())

print(soup.find_all('_image_src'))

I noticed though that if print the soup it's not showing me everything on the view page, i.e. not printing the _image_src. Why is this not giving me everything on that page?

colidyre
  • 4,170
  • 12
  • 37
  • 53
Bob
  • 279
  • 6
  • 13
  • It says on the tin: `data:image/jpeg;base64`, so you take the value, base64-decode it, and store as a jpeg file. But this is likely a small thumbnail. – 9000 Apr 17 '18 at 19:14
  • on the tin? what does that mean? – Bob Apr 17 '18 at 19:18
  • It's [a figurative expression](https://en.wikipedia.org/wiki/Does_exactly_what_it_says_on_the_tin). It means that the description of the contents is present right before our eyes on the packaging. – 9000 Apr 17 '18 at 19:48
  • oh ok, well it didn't work – Bob Apr 17 '18 at 19:49
  • Please take another look at [a very similar question / answer](https://stackoverflow.com/a/19395899/223424). – 9000 Apr 17 '18 at 19:51

0 Answers0