0

I am trying to scrape a website: www.gall.nl in order to create a database of all wines that are sold on this platform. I have the following code:

import requests
from bs4 import BeautifulSoup
URL = 'https://www.gall.nl/wijn/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

mydivs = soup.find_all("div", {"class": "c-product-tile"})    
print(len(mydivs))
first_wijn = mydivs[0]
print(first_wijn)
result = first_wijn.find()

So, this provides 12 results, which is correct.

Printing the first result provides the following:

<div class="c-product-tile" data-product='{"name":"Faustino V Rioja Reserva","id":"143561","currencyCode":"EUR","price":13.99,"discount":0,"brand":"Faustino","category":"Wijn","variant":"75CL","list":"productoverzicht","position":1,"dimension13":"2","dimension37":"Ja"}' itemprop="item" itemscope="" itemtype="https://schema.org/Product" js-hook-product-tile="">
<meta content="143561" itemprop="sku">
<meta content="8410441412065" itemprop="gtin8">
<meta content="Faustino" itemprop="brand">
<div class="product-tile__header">
<div class="product-tile__category-label">
<div class="m-product-taste-tooltip">
<span aria-label="Classic Red" class="a-tooltip-trigger" data-content="Stevig &amp; Ferm" data-placement="bottom-start" js-hook-tooltip="">
<div class="tooltip-trigger__icon product-taste-tooltip__icon u-taste-profile-icon classic-red-red 
....
<input class="add-to-cart-url" type="hidden" value="/on/demandware.store/Sites-gall-nl-Site/nl_NL/Cart-AddProduct"/>
</div>
</meta></meta></meta></div>

And I'm interested in getting the data from the first line:

<div class="c-product-tile" data-product='{"name":"Faustino V Rioja Reserva","id":"143561","currencyCode":"EUR","price":13.99,"discount":0,"brand":"Faustino","category":"Wijn","variant":"75CL","list":"productoverzicht","position":1,"dimension13":"2","dimension37":"Ja"}' itemprop="item" itemscope="" itemtype="https://schema.org/Product" js-hook-product-tile="">

In order to get the name, price and brand.

Can somebody help me with retrieving these data?

0stone0
  • 34,288
  • 4
  • 39
  • 64
Tobias
  • 137
  • 10

1 Answers1

0

Use beautifulsoup's .attrs.get to get the data-product from the div
Then, convert to JSON to read desired values.

import json
import requests
from bs4 import BeautifulSoup

URL = 'https://www.gall.nl/wijn/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

# Get all products
mydivs = soup.find_all("div", {"class": "c-product-tile"})

# Loop through each product
for div in mydivs:

    # Get data-product
    product = div.attrs.get("data-product", None)

    # Convert string to json
    jsonProduct = json.loads(product.encode('utf-8').decode('ascii', 'ignore'))

    # Show name - brand - price
    print('{0:<40} {1:<20} {2:>10}'.format(
        jsonProduct['name'],
        jsonProduct['brand'],
        jsonProduct['price']
    ))

Using the format() to create 3 columns, the above code produces the following output:

Faustino V Rioja Reserva                 Faustino                  13.99
Mucho Ms Tinto                           Mucho Mas                  5.99
Cantina di Verona Valpolicella Ripasso   Terre Di Verona           11.99
Villa Jeantel                            Villa Jeantel              8.99
Ondarre Rioja Reserva                    Ondarre                   13.59
Valdivieso Chardonnay                    Valdivieso                 5.99
Domaine Lamourie Ros                     Domaine Lamourie           7.99
Oveja Negra Chardonnay Viognier          Oveja Negra                6.59
La Palma Merlot                          La Palma                   6.59
Alamos Chardonnay                        Alamos                     8.99
Les Hautes Pentes ros                    Les Hautes Pentes          7.99
Piccini Memoro Rosso                     Piccini                    7.29
0stone0
  • 34,288
  • 4
  • 39
  • 64