1

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.

This is what I have so far:

import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')

for tag in links:
  link = tag.get('href', None)
  if link is not None and 'GetSource' in link:
    print(link)

Any suggestions on how I might get this to work? Thanks in advance.

user2525951
  • 30
  • 1
  • 7

1 Answers1

1

Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)

# click the links for pages 1-29
for i in range(1, 30):
    path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
    driver.find_element_by_xpath('//a[@href=' + path_string + ']').click()

# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')

# proceed as normal from here
for tag in links:
    link = tag.get('href', None)
    if link is not None and 'GetSource' in link:
        print(link)

Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:

import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"

# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')

# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'

for i in range(1, 30):
    # add page number to the url
    paginated_url = original_url + url_suffix + str(i)
    response = requests.get(paginated_url)
    soup = BeautifulSoup(response.content, "html5lib")
    # append resulting list to 'links' list
    links += soup.find_all('a')

# proceed as normal from here
for tag in links:
    link = tag.get('href', None)
    if link is not None and 'GetSource' in link:
        print(link)

I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

Brendan Goggin
  • 2,061
  • 14
  • 14
  • The problem with this approach is that it doesn't actually get the links for page 2, 3, 4, etc. Instead we get the same links from page 0 over and over again. This can be validated by looking up some username from another page. – user2525951 Aug 09 '17 at 22:13
  • I added a better approach, and left the original approach at the bottom of the answer. The approach I added uses selenium to do what you want. I believe the original approach does work in loading page 2, 3, 4, etc., but you are right that it also loads page 0 a total of thirty times. – Brendan Goggin Aug 09 '17 at 22:23