0

I need to scrape a webpage the link to which is here In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded. Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?

ou_ryperd
  • 2,037
  • 2
  • 18
  • 23
A.Hamza
  • 219
  • 1
  • 4
  • 13

2 Answers2

1

The data is loaded through Javascript, but you can extract it with requests, BeautifulSoup and json module:

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )

d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
    if item['componentName'] == 'PdpWrapper':
        d = item
        break

if d:
    cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
    print(json.dumps(cross_reverence_product_tiles, indent=4))

Prints:

[
    {
        "partId": "16571604",
        "partNumber": "CGB3B1X5R1A475M055AC",
        "productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
        "manufacturerName": "TDK",
        "productLineTitle": "Capacitor Ceramic Multilayer",
        "productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
        "datasheetUrl": "",
        "lowestPrice": 0.0645,
        "lowestPriceFormatted": "$0.0645",
        "highestPrice": 0.3133,
        "highestPriceFormatted": "$0.3133",
        "stockFormatted": "1,875",
        "stock": 1875,
        "attributes": [],
        "buyingOptionType": "AddToCart",
        "numberOfAttributesToShow": 1,
        "rrClickTrackingUrl": null,
        "pricingDataPopulated": true,
        "sourcePartId": "V72:2272_06586404",
        "sourceCode": "ACNA",
        "packagingType": "Cut Strip",
        "unitOfMeasure": "",
        "isDiscontinued": false,
        "productTileHint": null,
        "tileSize": 1,
        "tileType": "1x1",
        "suplementaryClasses": "u-height"
    },

...and so on.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks man. I get it but could you explain me the replace part. Thanks for the reply – A.Hamza Aug 07 '19 at 06:42
  • @A.Hamza When you do `print( soup.select_one('#arrow-state').text )`, you will see that the text is encoded - before `json` module could parse it, the quotation symbols (`&q;`, `&g;` etc.) need to be replaced by their respective characters. – Andrej Kesely Aug 07 '19 at 06:45
0

Selenium alone will be enough to scrape the Cross References section inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

      print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
    
  • Using XPATH:

      print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])
    
  • Note : You have to add the following imports :

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

      ['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • As I mentioned in question, I have already done it using selenium. I wanted to do it using requests and beautifulsoup. Thanks anyway. – A.Hamza Aug 07 '19 at 08:15