0

For some reason, when I use the python library 'requests', to GET request a website's html code. It doesn't return the full html code.

What is happening?

import re
import requests

url = 'https://www.aliexpress.com/item/Dragon-Ball-Z-Mug-SON-Goku-Mug-Hot-Changing-Color-Cups-Heat-Reactive-Mugs-and-Cups/32649664569.html'

mess = requests.get(url)

print(mess.text, '\n', '_'*20)

approved = []
images = re.findall(r'(?<=src=")[a-zA-Z0-9 \/\\,._-]+(?=")', mess.text)

for image in images:
    print(image)
    base, ext = image.rsplit('.', 1)

    if ext == 'png' or ext == 'jpg' or ext == 'JPG':
        approved.append(image)

Output:


//u.alicdn.com/js/aplus_ae.js
//i.alicdn.com/ae-header/20170208145626/buyer/front/ae-header.js

This picture shows that there is an 'img' tag with the attribute 'src' which is a jpg. But for some reason, it's not captured in the output.

  • [Don't use regex to parse HTML](http://stackoverflow.com/a/1732454/2482744). Use BeautifulSoup. – Alex Hall Mar 26 '17 at 14:26
  • Many (most?) modern websites include dynamic content that is generated on-the-fly via Javascript. This content will not be available in the response to a `GET` request. It's possible you are encountering this situation. – larsks Mar 26 '17 at 14:28
  • What should I do @larsks to solve it? –  Mar 26 '17 at 14:30

1 Answers1

0

To fetch elements from html content, there are very sophisticated modules available such as lxml, BeautifulSoup, etc.

You can use lxml to achieve this which is order of magnitude faster than BeautifulSoup something like this :

from lxml import html
import requests

url = 'https://www.aliexpress.com/item/Dragon-Ball-Z-Mug-SON-Goku-Mug-Hot-Changing-Color-Cups-Heat-Reactive-Mugs-and-Cups/32649664569.html'

mess = requests.get(url).content

root = html.fromstring(mess)
print(root.xpath('//a[@class="ui-image-viewer-thumb-frame"]/img/@src'))

This will result in :

['https://ae01.alicdn.com/kf/HTB16NR_MpXXXXa5XpXXq6xXFXXX0/Dragon-Ball-Z-Mug-SON-Goku-Mug-Hot-Changing-Color-Cups-Heat-Reactive-Mugs-and-Cups.jpg_640x640.jpg']

Ypu can refer to the documentation here for more exploration.

Satish Prakash Garg
  • 2,213
  • 2
  • 16
  • 25
  • 1. What does: "html.fromstring()" ? 2. instead of using xpath to search for the specific img, is it possible to search for all img tags instead? –  Mar 26 '17 at 14:35
  • Yes,` root.xpath('//img/@src')`. I would recommend you to read the documentation. – Satish Prakash Garg Mar 26 '17 at 14:37