1

Before I start asking questions, I apologize that I'm a Korean high school student so my questions can be hard to read.

I want my code to print src of image, but It prints None when i is over 22 so I can't download image as many as I want.

It prints like this. This is image src when I insert keyword 'cat'.

20 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQdMIU_4V4XtUAiV2uOBmeixkhQuy6N3eaHH1XuUzOYFyQZBZefEg

21 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQvmdG435HxyF0e1DP1IBVos10zTwuNJ0p9M_iYDzlYWup6AgfV6w

22 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQL8NCMT9h7p8koWq3pgyhS8EebE9qh24e-5SQWzIpmDgBNvNaO

23 None

24 None

25 None

26 None

I searched google for about hour but I couldn't find error(bug?) of this That's why I make a question on stackoverflow for the first time

I skipped function named make_dir

import os
import shutil
import urllib.request
import time

from selenium import webdriver

def crawl(keyword, max_count):
    cnt = 0

    url = "https://www.google.co.in/search?q=" + keyword + "&tbm=isch"  # google search url with search word

    browser = webdriver.Chrome("C:\\Users\\Master\\Desktop\\crawling\\chromedriver.exe")  # webdriver
    browser.get(url)  # open web page

    img_list = browser.find_elements_by_class_name("rg_ic")  # find image


    for i, el in enumerate(img_list):
        if cnt >= max_count:
            break

        img = img_list[i]
        src = img.get_attribute('src')
        if src is None:
            print(i, src)  # img_list includes None so I need to fix it
            continue

        cnt += 1
        print(i, src)  # print src
        urllib.request.urlretrieve(src, str(cnt) + ".png")  # download image

    browser.quit()

if __name__ == "__main__":
    max_count = int(input("Number of crawls : "))
    keyword = input("Search word : ")

    make_dir()
    crawl(keyword, max_count)

I made code to print src. It prints src until i is 23 but when it over 22, these print only None I want to make them print right src

20 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQdMIU_4V4XtUAiV2uOBmeixkhQuy6N3eaHH1XuUzOYFyQZBZefEg

21 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQvmdG435HxyF0e1DP1IBVos10zTwuNJ0p9M_iYDzlYWup6AgfV6w

22 https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQL8NCMT9h7p8koWq3pgyhS8EebE9qh24e-5SQWzIpmDgBNvNaO

23 None

24 None

25 None

26 None

김영석
  • 13
  • 1
  • 3
  • I haven't tested this personally but you're facing this problem probably because google uses lazy loading to load images which means that only images in the viewport are loaded. You might need to scroll to load those images. See answer at https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python – Aditya Agrawal Sep 21 '19 at 13:17
  • I'm so glad you try to help me, but I think the problem that you pointed out is Irrelevant with this case. I tried you pointed out because it could work, but still this includes None in list – 김영석 Sep 21 '19 at 14:08
  • Hey, check my answer and let me know if it works – Aditya Agrawal Sep 21 '19 at 14:26

1 Answers1

1

Try this as your crawl function. Google uses lazy loading which causes the image link to be a value of the attribute data-src until the image enters the viewport. I haven't tested the snippet but it should work

def crawl(keyword, max_count):
    cnt = 0

    url = "https://www.google.co.in/search?q=" + keyword + "&tbm=isch"  # google search url with search word

    browser = webdriver.Chrome("C:\\Users\\Master\\Desktop\\crawling\\chromedriver.exe")  # webdriver
    browser.get(url)  # open web page

    img_list = browser.find_elements_by_class_name("rg_ic")  # find image


    for i, el in enumerate(img_list):
        if cnt >= max_count:
            break

        img = img_list[i]
        src = img.get_attribute('src')
        if src is None:
            src = img.get_attribute('data-src')
            if src is None:
                continue


        cnt += 1
        print(i, src)  # print src
        if src[0]=='h':
            urllib.request.urlretrieve(src, str(cnt) + ".png")
        else:
            with open(str(cnt) + ".png", "wb") as fh:

                print(src[23:])
                fh.write(base64.b64decode(src[22:]))

    browser.quit()

The code uses some ugly hacks like if src[0]=='h' and is just there for representational purposes

Aditya Agrawal
  • 375
  • 1
  • 14
  • Thank you very much for helping me like this much, but when I use your code, some error occur – 김영석 Sep 21 '19 at 14:57
  • Glad to help, can you please tell what error did you encounter? I checked the snippet on my machine and it correctly retrieves all the URLs – Aditya Agrawal Sep 21 '19 at 14:58
  • Traceback (most recent call last): File "C:/Users/master/Desktop/crawling/crawling.py", line 58, in crawl(keyword, max_count) File "C:/Users/master/Desktop/crawling/crawling.py", line 33, in crawl urllib.request.urlretrieve(src, str(cnt) + ".png") # download image File "C:\Users\master\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 245, in urlretrieve url_type, path = splittype(url) File "C:\Users\master\AppData\Local\Programs\Python\Python37\lib\urllib\parse.py", line 973, in splittype – 김영석 Sep 21 '19 at 14:58
  • match = _typeprog.match(url) TypeError: expected string or bytes-like object – 김영석 Sep 21 '19 at 14:58
  • This is because Google encodes visible images as base64 causing the image source to be the image itself and not a valid URL. You can use a simple conditional to check if "src" is a valid URL, if not treat it as a base64 encoded image and write it to the file – Aditya Agrawal Sep 21 '19 at 15:26
  • Wow! that's interesting! I would never find out the problem if I didn't make a question on here! then how do I don't treat this as base64 encoded? – 김영석 Sep 21 '19 at 15:32
  • I updated my answer to include base64 images, try now. I tested it on my machine and it works – Aditya Agrawal Sep 21 '19 at 15:34
  • Wow! It totally works! I really love you! I tried this since last week but now you solved. Thank you very much!!! Have a nice day! – 김영석 Sep 21 '19 at 15:40