2

I can't seem to see what is missing. Why is the response not printing the ASINs?

import requests
import re

urls = [
    'https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2',
    'https://www.amazon.com/s?k=ps4+game&ref=nb_sb_noss_2'
]

for url in urls:
    content = requests.get(url).content
    decoded_content = content.decode()

    asins = set(re.findall(r'/[^/]+/dp/([^"]+)', decoded_content))
    print(asins)

traceback

set()
set()
[Finished in 0.735s]
mjbaybay7
  • 99
  • 5
  • 2
    There is a [detailed answer](https://stackoverflow.com/a/1732454/2280890) regarding extracting information from HTML with regular expressions. Additionally, you're probably encountering an error regarding automated access to Amazon data. You should check the response status code before processing the content. – import random Dec 15 '20 at 02:33

1 Answers1

2

Regular expressions should not be used to parse HTML. Every StackOverflow answer to questions like this do not recommend regex for HTML. It is difficult to write a regular expression complex enough to get the data-asin value from each <div>. The BeautifulSoup library will make this task easier. But if you must use regex, this code will return everything inside of the body tags:

re.findall(r'<body.*?>(.+?)</body>', decoded_content, flags=re.DOTALL)

Also, print decoded_content and read the HTML. You might not be receiving the same website that you see in the web browser. Using your code I just get an error message from Amazon or a small test to see if I am a robot. If you do not have real headers attached to your request, big websites like Amazon will not return the page you want. They try to prevent people from scraping their site.

Here is some code that works using the BeautifulSoup library. You need to install the library first pip3 install bs4.

from bs4 import BeautifulSoup
import requests

def getAsins(url):
    headers = requests.utils.default_headers()
    headers.update({'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
    decoded_content = requests.get(url, headers=headers).content.decode()
    soup = BeautifulSoup(decoded_content, 'html.parser')
    asins = {}
    for asin in soup.find_all('div'):
        if asin.get('data-asin'):
            asins[asin.get('data-uuid')] = asin.get('data-asin')
    return asins

'''
result = getAsins('https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2')
print(result)

{None: 'B07RBN5C9C', '8652921a-81ee-4e15-b12d-5129c3d35195': 'B07P15JL3T', 'cb25b4bf-efc3-4bc6-ae7f-84f69dcf131b': 'B0886YWLC9', 'bc730e28-2818-472d-bc03-6e9fb97dcaad': 'B089F8R7SQ', '339c4ca0-1d24-4920-be60-54ef6890d542': 'B08GQW447N', '4532f725-f416-4372-8aa0-8751b2b090cc': 'B08DD5559K', 'a0e17b74-7457-4df7-85c9-5eefbfe4025b': 'B08BXHCQKR', '52ef86ef-58ac-492d-ad25-46e7bed0b8b9': 'B087XR383W', '3e79c338-525c-42a4-80da-4f2014ed6cf7': 'B07H5VVV1H', '45007b26-6d8c-4120-9ecc-0116bb5f703f': 'B07DJW4WZC', 'dc061247-2f4c-4f6b-a499-9e2c2e50324b': 'B07YLGXLYQ', '18ff6ba3-37b9-44f8-8f87-23445252ccbd': 'B01FST8A90', '6d9f29a1-9264-40b6-b34e-d4bfa9cb9b37': 'B088MZ4R82', '74569fd0-7938-4375-aade-5191cb84cd47': 'B07SXMV28K', 'd35cb3a0-daea-4c37-89c5-db53837365d4': 'B07DFJJ3FN', 'fc0b73cc-83dd-44d9-b920-d08f07be76eb': 'B07KYC1VL7', 'eaeb69d1-a2f9-4ea4-ac97-1d9a955d706b': 'B076PRWVFG', '0aafbb75-1bac-492c-848e-a046b2de9978': 'B07Q47W1B4', '9e373245-9e8b-4564-a32f-42baa7b51d64': 'B07C4SGGZ2', '4af7587a-98bf-41e0-bde6-2a2fad512d95': 'B07SJ2T3CW', '8635a92e-22a7-4474-a27d-3db75c75e500': 'B08D44W56B', '49d752ce-5d68-4323-be9b-3cbb34c8b562': 'B086JQGB7W', '6398531f-6864-4c7b-9879-84ee9de57d80': 'B07XD3TK36'}
'''

If you are reading html from a file then:

from bs4 import BeautifulSoup
import requests

def getAsins(location_to_file):
    file = open(location_to_file)
    soup = BeautifulSoup(file, 'html.parser')
    asins = {}
    for asin in soup.find_all('div'):
        if asin.get('data-asin'):
            asins[asin.get('data-uuid')] = asin.get('data-asin')
    return asins
Raymond Mutyaba
  • 950
  • 1
  • 9
  • 14
  • I see what you are saying now. With the headers, I am able to see the message you are referring to. Is there way to retrieve the ASINS using BeautifulSoup? – mjbaybay7 Dec 15 '20 at 05:18
  • 1
    I added an example that returns a dictionary with the `'data-uuid'` as the keys and `'data-asin'` as the values. You can also just create a list of asins with `myList.append(asin.get('data-asin'))`. – Raymond Mutyaba Dec 15 '20 at 06:18
  • What is the purpose of they key and value? Does the key need to be before the value or does the order not matter? Edit: I am getting empty brackets for the response: { } – mjbaybay7 Dec 15 '20 at 17:30
  • 1
    Dictionaries store entries as key-value pairs. You use the keys to access the values in the same way list values are accessed by their position. `dict['key'] , list[0]`. The only way I get empty brackets with this code is if my headers are incorrect. `print(soup.prettify())` to check if you are getting the real webpage. – Raymond Mutyaba Dec 15 '20 at 20:10
  • Ok using prettify, I see that amazon is returning an error due to api authentication. To modify this script, I have saved the html of the page im checking into a file. Using BS4, how can I point this script to the file I have created? I feel this way Amazon won't need captcha authentication – mjbaybay7 Dec 15 '20 at 20:23
  • 1
    `file = open('file.html')` then `soup = BeautifulSoup(file, 'html.parser')` – Raymond Mutyaba Dec 15 '20 at 20:52