2

I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:

url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result: https://www.nga.gov/content/ngaweb/collection-search-result.html

How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?

Sanya Pushkar
  • 180
  • 1
  • 16
  • Use [selenium](https://selenium-python.readthedocs.io/) as it simulates search and other actions and help scrape. What maybe happening here is that your search doesn't gets completes and `requests` already reads the HTML in. – avats Oct 20 '21 at 20:40

2 Answers2

1

Actually, data is generating from api calls json response. Here is the desired list of links.

Code:

import requests
import json

url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)

for item in r.json()['results']:
    url = item['url']
    abs_url = f'https://www.nga.gov{url}'
    print(abs_url)

Output:

https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
0

This seems to work for me:


from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
    print(a['href'])

It returns all of the html a href links.

For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/

If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.

james-see
  • 12,210
  • 6
  • 40
  • 47