xpath lxml cannot get all elements inside the ul tag of html

Question

I had a problem with the lxml xpath, my example code below is used to Get all the data-asin of tag Li inside Ul with xpath is:

"//*[@id ="s-results-list-atf"]/li/@data-asin".

Strangely, I only received 6 li, while there were 46 li

someone please help me show where my error lies

p/s : use python 2.7

from lxml import html
import csv, os, json
import random
import requests
from exceptions import ValueError
from time import sleep

def getAsin():
    headers_list = [
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2211.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2111.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.3211.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2221.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2212.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2213.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2214.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2215.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2216.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2217.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2218.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2219.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2231.90 Safari/537.36'},
        {
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2241.90 Safari/537.36'},
    ]
    headers = random.choice(headers_list)
    url = 'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=t-shirts&rh=i%3Aaps%2Ck%3At-shirts'
    page = requests.get(url, headers=headers)
    while True:
        sleep(3)
        try:
            doc = html.fromstring(page.content)
            XPATH_NAME = '//*[@id="s-results-list-atf"]/li/@data-asin'
            RAW_NAME = doc.xpath(XPATH_NAME)
            print 'aaaaaaaaa',RAW_NAME
            if page.status_code != 200:
                raise ValueError('captha')
            return RAW_NAME
        except Exception as e:
            print e
if __name__ == "__main__":
    getAsin()

`

I check the HTML block returned from the request, it seems that the returned html code is different from the HTML code on Chrome. I did it .;) , Thanks — GEV Entertainment, Jan 18 '19 at 08:45

score 0 · Accepted Answer · answered Jan 17 '19 at 14:24

0

It seem that not all list items appears in list "#s-results-list-atf"

Try to use

doc.xpath('//li[starts-with(@id, "result_")]/@data-asin')

to get complete list (60 items)

answered Jan 17 '19 at 14:24

Andersson

51,635
17
77
129

I check the HTML block returned from the request, it seems that the returned html code is different from the HTML code on Chrome. I did it .;) – GEV Entertainment Jan 18 '19 at 08:44
@GEVEntertainment , this is exactly what I told you in my answer – Andersson Jan 18 '19 at 08:48
:3 , thanks , I have to do how to add a proxy to this request ? page = requests.get (url, headers=headers) – GEV Entertainment Jan 18 '19 at 08:55
@GEVEntertainment , did you check [this](https://stackoverflow.com/questions/8287628/proxies-with-python-requests-module)? – Andersson Jan 18 '19 at 08:56
oh ,I will check it now;) – GEV Entertainment Jan 18 '19 at 09:00
sorry ! I tried like this and error occurred with it: url = 'https://www.fahasa.com' headers ={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2241.90 Safari/537.36'}, PROXY = {"https": "https//125.27.251.88:59173"} headers = random.choice(headers_list) page = requests.get(url, proxies=PROXY, headers=headers, timeout=123) doc = html.fromstring(page.content) – GEV Entertainment Jan 18 '19 at 09:06
@GEVEntertainment which error? You should also pass the *scheme*: `url = 'fahasa.com'` -> `url = 'https://fahasa.com'` – Andersson Jan 18 '19 at 09:10

xpath lxml cannot get all elements inside the ul tag of html

1 Answers1