0

I am trying to scrape specific text from specific table elements on an Amazon product page.

URL_1 has all elements - https://www.amazon.com/dp/B008Q5LXIE/ URL_2 has only 'Sales Rank' - https://www.amazon.com/dp/B001V9X26S

URL_1: The "Product Details" table has 9 items and I am only interested in 'Product Dimensions', 'Shipping Weight', Item Model Number, and all 'Seller's Rank'

I am not able to parse out the text on these items as some are in one block of code, where others are not.

I am using beautifulsoup and I have run a text.strip() on the table and got everything but very messy. I have tried soup.find('li') and text.strip() to find individual elements but with seller rank, it returns all 3 ranks jumbled in one return. I have also tried regex to clean text but it won't work for the 4 different seller ranks. I have had success using the Try, Except, Pass method for scraping and would have each of these in that format

A bad example of the code used, I was trying to get sales rank past the </b> 
element in the HTML
#Sales Rank
        sales_rank ='NOT'
        try:
            sr = soup.find('li', attrs={'id':'SalesRank'})
            sales_rank = sr.find('/b').text.strip()
        except:
            pass

I expect to be able to scrape the listed elements into a dictionary. I would like to see the results as

dimensions = 6x4x4
weight = 4.8 ounces
Item_No = IT-DER0-IQDU
R1_NO = 2,036
R1_CAT = Health & Household
R2_NO = 5
R2_CAT = Joint & Muscle Pain Relief Medications
R3_NO = 3
R3_CAT = Naproxen Sodium
R4_NO = 6
R4_CAT = Migraine Relief

my_dict =   {'dimensions':'dimensions','weight':'weight','Item_No':'Item_No', 'R1_NO':R1_NO,'R1_CAT':'R1_CAT','R2_NO':R2_NO,'R2_CAT':'R2_CAT','R3_NO':R3_NO,'R3_CAT':'R3_CAT','R4_CAT':'R4_CAT'}

URL_2: The only element of interest on page is 'Sales Rank'. 'Product Dimensions', 'Shipping Weight', Item Model Number are not present. However, I would like a return similar to that of URL_1 but the missing elements would have a value of 'NA'. Same results as URL_1, only 'NA' is given when an element is not present. I have had success accomplishing this by setting a value prior to the Try/Except statement. Ex: Shipping Weight = 'NA' ... then run try/except: pass , so I get 'NA' and my dictionary is not empty.

workin 4weekend
  • 371
  • 2
  • 11

1 Answers1

1

You could use stripped_strings and :contains with bs4 4.7.1. This feels like a lot of jiggery pokery to get the desired output format. Sure someone with more python experience could reduce this and improve its efficiency. Merging dicts syntax taken from @aaronhall.

import requests
from bs4 import BeautifulSoup as bs
import re

links = ['https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']

for link in links:

    r = requests.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
    soup = bs(r.content, 'lxml')
    fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']

    temp_dict = {}

    for field in fields:
        element = soup.select_one('li:contains("' + field + '")')
        if element is None:
            temp_dict[field] = 'N/A'
        else:
            if field == 'Amazon Best Sellers Rank':
                item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
                temp_dict[field] = item
            else:
                item = [string for string in element.stripped_strings][1]
                temp_dict[field] = item.replace('(', '').strip()

    ranks = soup.select('.zg_hrsr_rank')
    ladders = soup.select('.zg_hrsr_ladder')

    if ranks:
        cat_nos = [item.text.split('#')[1] for item in ranks]
    else:
         cat_nos = ['N/A']

    if ladders:                      
        cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
    else:
        cats = ['N/A']

    rankings = dict(zip(cat_nos, cats))

    map_dict = {
        'Product Dimensions': 'dimensions',
        'Shipping Weight': 'weight', 
        'Item model number': 'Item_No',
        'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']
    }

    final_dict = {}

    for k,v in temp_dict.items():
        if k == 'Amazon Best Sellers Rank' and v!= 'N/A':
            item = dict(zip(map_dict[k],v))
            final_dict = {**final_dict, **item}
        elif k == 'Amazon Best Sellers Rank' and v == 'N/A':
            item = dict(zip(map_dict[k], [v, v]))
            final_dict = {**final_dict, **item}
        else:
            final_dict[map_dict[k]] = v

    for k,v in enumerate(rankings):
        #print(k + 1, v, rankings[v])
        prefix = 'R' + str(k + 2) + '_'
        final_dict[prefix + 'NO'] = v  
        final_dict[prefix + 'CAT'] = rankings[v]

    print(final_dict)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Oops.... I should have scrolled further down the question as I didn't see that bit! I will have another look. – QHarr Jul 10 '19 at 03:45
  • Can you provide an example url where this is the case and the expected output? – QHarr Jul 10 '19 at 11:56
  • Please try bottom version with a few urls. – QHarr Jul 10 '19 at 13:12
  • Not a problem. Thank you for feeding back :-) – QHarr Jul 10 '19 at 15:19
  • can you provide an example url where this occurs? – QHarr Jul 13 '19 at 18:54
  • Hi, I am getting 'R1_NO' :'.zg_hrsr { margin: 0; padding: 0; list-style-type: none; }\n.zg_hrsr_item { margin: 0 0 0 10px; }\n.zg_hrsr_rank { display: inline-block; width: 80px; text-align: right; }'}' and no return for R1_CAT on URLS like this one: https://www.amazon.com/dp/B01N1ZD912 – workin 4weekend Jul 17 '19 at 12:33
  • I will look but if too much variation I may have to admit defeat. – QHarr Jul 17 '19 at 13:47
  • the html structure is different. For at least the first the content is no longer entirely in li elements. It is now in th (for the _field_ ) and the next sibling td for content. You will need to write a test for this. Potentially use branched code as safer than extending selectors. – QHarr Jul 18 '19 at 20:10
  • an easier Q, how can i solve the "'R1_NO' :'.zg_hrsr { margin: 0; padding: 0; li..." in the code? this relates to the li statements when there is no value for the stripped texts. Looking how to solve for the missing text – workin 4weekend Jul 18 '19 at 21:22
  • ^^ oops... seen there is an example url. Will look tomorrow. – QHarr Jul 18 '19 at 21:54
  • Thank you for going to take time out to help me – workin 4weekend Jul 19 '19 at 20:21