1

I'm trying to access a webpage and return all of the hyperlinks on that page. I'm using the same code from a question that was answered here.

I wish to access this correct page, but It is only returning content from this incorrect page.

Here is the code I am running:

import httplib2
from bs4 import SoupStrainer, BeautifulSoup

http = httplib2.Http()
status, response = http.request('https://www.iparkit.com/Minneapolis')

for link in BeautifulSoup(response, 'html.parser', parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print (link['href'])

Results:

/account
/monthlyAccount
/myproducts
/search
/
{{Market}}/search?expressSearch=true&timezone={{timezone}}
{{Market}}/search?expressSearch=false&timezone={{timezone}}
{{Market}}/events
monthly?timezone={{timezone}}
/login?next={{ getNextLocation(browserLocation) }}
/account
/monthlyAccount
/myproducts
find
parking-app
iparkit-express
https://interpark.custhelp.com
contact
/
/parking-app
find
https://interpark.custhelp.com
/monthly
/iparkit-express
/partners
/privacy
/terms
/contact
/events

I don't mind returning the above results, but It doesn't return any links that could get me to the page I want. Maybe it's protected? Any ideas or suggestions, thank you in advance.

mm_nieder
  • 431
  • 1
  • 4
  • 10
  • 1
    Your code is broken. You didn't include `import BeautifulSoup`. And `if link.has_attr('href'):` has `import SoupStrainer` under it instead of what I assume would be a print statement. I also don't see the need to include the Chicago loop if it's not the issue. I get the same result as you when I fix it up and run it for what it's worth. – Zhenhir May 07 '18 at 17:02
  • Hi @Zhenhir thank you for your response, but its there I just didnt input that here I'll add it. Also I double pasted the code, I made the necessary edits. Same results... – mm_nieder May 07 '18 at 17:38

1 Answers1

1

The page you are trying to scrape is full JavaScript generated.

This http.request('https://www.iparkit.com/Minneapolis') would give almost nothing in this case.

Instead, you must do what a real browser do - Process JavaScript, then try to scrape what has been processed. For this you can try Selenium.

For your page, after running JavaScript you will get ~84 URLs, while trying to scrape without running JavaScript, you would get ~7 URLs.

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('PATH_TO_CHROME_WEBDRIVER', chrome_options=chrome_options)
driver.get('https://www.iparkit.com/Minneapolis')

content = driver.page_source

Then you extract what you want from that content using BeautifulSoup in your case.

Temperosa
  • 168
  • 2
  • 8