-3

I am trying to extract all the links from a forum (https://www.pakwheels.com/forums/c/travel-n-tours) My scraper class stops after scrolling down once.

from bs4 import BeautifulSoup

sourceUrl='https://www.pakwheels.com/forums/c/travel-n-tours'

#----------------------------------Source of below code:http://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python--------------------#
#----------------------- Scrolling to the bottom of page ----------------------------- ----------#

from selenium import webdriver
import time
chrome_path=r"C:\Users\Shani\Desktop\chromedriver.exe"
driver=webdriver.Chrome(chrome_path)
driver.get(sourceUrl)
updatedLenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
scrollComplete=False
while(scrollComplete==False):
        currentLenOfPage = updatedLenOfPage
        updatedLenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        print('Scrolling down')
        time.sleep(5)
        if currentLenOfPage==updatedLenOfPage:
            scrollComplete=True
time.sleep(10)
pageSource=driver.page_source

# ------------------------------------- Getting links ---------------------------------- #
soup = BeautifulSoup(pageSource, 'lxml')
# print(soup)

blogUrls=[]
for url in soup.find_all('a'):
    if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)):
        blogUrls.append(url.get('href'))
        print(url.get('href'))       
print(len(blogUrls))

It gives the following error

Traceback (most recent call last):
  File "D:\LiclipsWorkSpace\NLKTLib\Scraping\scrolling.py", line 32, in <module>
    if((url.get('href').find('/forums/t/')!=-1) and (url.get('href').find('about-the-travel-n-tours-category')==-1) and (url.get('href').find('/forums/t/topic/')==-1)):
AttributeError: 'NoneType' object has no attribute 'find'

Please help

  • Yes you can say but, i couldn't understand the answer in that question. This is what i am trying to do and getting errors. Any suggestions? – Zeeshan Ul Haq Apr 08 '17 at 07:52

1 Answers1

1

You don't need Selenium, you can get all links from json response. This code gets urls from first 5 pages (for getting all pages simply change last 5 to 264).

import requests

for i in range(0, 5):
    r = requests.get(
        'https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?page={}'.format(i)).json()
    topics = r['topic_list']['topics']
    for topic in topics:
        print ('https://www.pakwheels.com/forums/t/{}/{}'.format(topic['slug'], topic['id']))
Vlad
  • 348
  • 3
  • 10