3

A simple question. i can scrape results from the first page of a duckduckgo search. However i am struggling to get onto the 2nd and subsequent pages. I have used Python with the Selenium webdriver, which is fine for the first page results. The code i have used to scrape the first page is:-

results_url = "https://duckduckgo.com/?q=paralegal&t=h_&ia=web" 
browser.get(results_url)
results = browser.find_elements_by_id('links') 
num_page_items = len(results) 
for i in range(num_page_items): 
    print(results[i].text) 
    print(len(results)) 

nxt_page = browser.find_element_by_link_text("Load More")
if nxt_page:
    nxt_page.send_keys(Keys.PAGE_DOWN)"

There are line breaks indicating the start of a new page but they do not appear to alter the url, so i tried the above to move down the page and then repeat the code for finding the links on the next_page. However it does not work. Any help would be very much appreciated

user8784011
  • 33
  • 1
  • 6

2 Answers2

0

If I search for Load More in the source code of the result I can't find it. Did you try using the non-javascript version?

You can use it by simply add htmlto the url: https://duckduckgo.com/html?q=paralegal&t=h_&ia=web There you can find the next button at the end.

This one works for me (Chrome version):

results_url = "https://duckduckgo.com/html?q=paralegal&t=h_&ia=web"
browser.get(results_url)
results = browser.find_elements_by_id('links')
num_page_items = len(results)
for i in range(num_page_items):
    print(results[i].text)
    print(len(results))
nxt_page = browser.find_element_by_xpath('//input[@value="Next"]')
if nxt_page:
    browser.execute_script('arguments[0].scrollIntoView();', nxt_page)
    nxt_page.click()

Btw.: Duckduckgo also provides a nice api, which is probably much easier to use ;)

edit: fix selector for next page link which selected the prev button on the second result page (thanks to @kingbode)

Jan Zeiseweis
  • 3,718
  • 2
  • 17
  • 24
  • 1
    Thank you for that. I was using the html in Selenium IDE and everything worked in that but when I transferred the code to Visual Studio, error kept appearing 'could not locate element'. However I have now solved that problem with cssselector. My code is nxt_page = browser.find_element_by_css_selector("input.btn") nxt_page.click() - this works a to find the button. thank you for your help. – user8784011 Oct 16 '17 at 16:18
0

calling class of 'btn--alt' when you go to second page will not work as this is the same class name for both buttons 'Next' and 'Previous', and it was clicking on previous button and return me back again !!

below code change worked for me perfectly

nextButton = driver.find_element_by_xpath('//input[@value="Next"]')
nextButton.click()

full function:

def duckduckGoSearch(query,searchPages = None,filterTheSearch = False,searchFilter = None):

URL_ = 'https://duckduckgo.com/html?'
driver = webdriver.Chrome()
driver.get(URL_)

query = query

searchResults = {}

filterTheSearch = filterTheSearch

searchFilter = searchFilter

searchFilter = searchFilter

# # click on search textBox
# item = driver.find_element_by_xpath('//*[@id="sb_form_q"]').click()
#
# #Enter your search query
item = driver.find_element_by_xpath('//*[@id="search_form_input_homepage"]').send_keys(query)

# # Click enter to perform the search process
item = driver.find_element_by_xpath('//*[@id="search_form_input_homepage"]').send_keys(Keys.RETURN)

time.sleep(2)

page_number = 1

while True:

    # loop for the required number of pages

    if page_number <= searchPages:

        try:

            nextButton = driver.find_element_by_xpath('//input[@value="Next"]')
            nextButton.click()

            page_number += 1

            try:
                webPageSource = driver.page_source

                # parse and get the urls for the results

                soup = BeautifulSoup(webPageSource, "html.parser")

                Data_Set_div_Tags = soup.findAll('h2') + soup.findAll('div', {'class': 'result__body links_main links_deep'})

                for i in range(0, len(Data_Set_div_Tags)):

                    try:
                        resultDescription = Data_Set_div_Tags[i].findAll('a')[0].text

                        resultURL = Data_Set_div_Tags[i].findAll('a')[0]['href']

                    except:
                        print('nothing to parse')
                        pass

                    if resultURL not in searchResults.keys():
                        if filterTheSearch:
                            if searchFilter in resultURL:
                                searchResults[resultURL] = resultDescription

                        else:
                            searchResults[resultURL] = resultDescription

            except:
                print('search is done , found ', len(searchResults), 'Results')
                break
                # pass

        except:  # change something so it stops scrolling
            print('search is done , found ', len(searchResults), 'Results')
            print('no more pages')
            driver.quit()
            break


    else:
        print('search is done , found ', len(searchResults), 'Results')
        driver.quit()
        break


return searchResults
kingbode
  • 131
  • 2
  • 8