2

I'm unable to scrape the links of the articles present in the paginated webpages. Additionally I get a blank screen at times as my output. I am unable to find the problem in my loop. Also the csv file doesn't get created.

from pprint import pprint
import requests
from bs4 import BeautifulSoup
import lxml
import csv
import urllib2

def get_url_for_search_key(search_key):
    for i in range(1,100):
        base_url = 'http://www.thedrum.com/'
        response = requests.get(base_url + 'search?page=%s&query=' + search_key +'&sorted=')%i
        soup = BeautifulSoup(response.content, "lxml")
        results = soup.findAll('a')
        return [url['href'] for url in soup.findAll('a')]
        pprint(get_url_for_search_key('artificial intelligence'))

with open('StoreUrl.csv', 'w+') as f:
    f.seek(0)
    f.write('\n'.join(get_url_for_search_key('artificial intelligence')))
Rrj17
  • 53
  • 10

1 Answers1

1

Are you sure, that you need only first 100 pages? Maybe there's more of them...

My vision of your task below, this will collect links from all pages and also precisely catches next page button links:

import requests
from bs4 import BeautifulSoup


base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

res = []

while 1:
    results = soup.findAll('a')
    res.append([url['href'] for url in soup.findAll('a')])

    next_button = soup.find('a', text='Next page')
    if not next_button:
        break
    response = requests.get(next_button['href'])
    soup = BeautifulSoup(response.content, "lxml")

EDIT: alternative approach for collecting only article links:

import requests
from bs4 import BeautifulSoup


base_url = 'http://www.thedrum.com/search?sort=date&query=artificial%20intelligence'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

res = []

while 1:
    search_results = soup.find('div', class_='search-results') #localizing search window with article links
    article_link_tags = search_results.findAll('a') #ordinary scheme goes further 
    res.append([url['href'] for url in article_link_tags])

    next_button = soup.find('a', text='Next page')
    if not next_button:
        break
    response = requests.get(next_button['href'])
    soup = BeautifulSoup(response.content, "lxml")

to print links use:

for i in res:
    for j in i:
        print(j)
Dmitriy Fialkovskiy
  • 3,065
  • 8
  • 32
  • 47
  • I took the first 100 pages just for initial testing purposes. The problem is that when I try to print the links based on your solution, I get a series of "None" printed one below another. – Rrj17 Jul 26 '17 at 08:10
  • Just used `pprint(res.append([url['href'] for url in soup.findAll('a')]))` after the snippet that you had provided. I am not sure if it's the right way to proceed. Quite confused. – Rrj17 Jul 26 '17 at 08:20
  • of course it's not correct=) in the end of the day you will have a _list of lists_. For printing each link you'll have to loop on each `list` of links and on each link inside each `list` - double looping. – Dmitriy Fialkovskiy Jul 26 '17 at 08:38
  • delete your print and inspect `res` variable after `while` loop ends. – Dmitriy Fialkovskiy Jul 26 '17 at 08:39
  • added valid print loop, please check. – Dmitriy Fialkovskiy Jul 26 '17 at 08:40
  • Thanks for that. The problem is I get a lot of unnecessary links and the output list goes on endlessly. How do I filter out only the urls of the articles? I'm trying to store the url's of these articles in a csv file. – Rrj17 Jul 26 '17 at 15:11
  • well, I'd say that you need a _rule_ that will indicate that a particular link is a link to article. What to start with? What you can do easily is to check if article links are absolute, like `http://www.thedrum.com/article1` or `http://www.thedrum.com/random_stuff`. Build a loop that cuts relative links. – Dmitriy Fialkovskiy Jul 26 '17 at 15:15
  • allternatively, if I had a task to collect artile urls, I wouldn't go as you started - by collecting ALL urls (because of a lot of trash obviously). Page markup in most cases indicates needed blocks where you can catch needed links. But your general purpose wasn't outlined in question initially=) – Dmitriy Fialkovskiy Jul 26 '17 at 15:17
  • Yeah...I realized that my question was too general. Sorry for that. – Rrj17 Jul 26 '17 at 15:28
  • Since you have an alternate approach, could you please guide me? – Rrj17 Jul 26 '17 at 15:47
  • Done! You're a savior! =) – Rrj17 Jul 27 '17 at 01:59
  • Any idea about incorporating selenium in python for scraping a webpage with a load more button. This is one of my previous questions - I'll need help with the first issue(stated in the question) - https://stackoverflow.com/questions/45186028/loading-more-content-in-a-webpage-and-issues-writing-to-a-file --- That is also for article URL scraping. – Rrj17 Jul 28 '17 at 06:38
  • that was for the second issue ( scraping info from the URLs in a csv to a txt file). The first issue is still unsolved :( – Rrj17 Jul 28 '17 at 06:51
  • I'm not so good at `selenium` as in `BeautifulSoup`=) I think you'd better ask COLDSPEED, probably he can help you, or ask a separate question – Dmitriy Fialkovskiy Jul 28 '17 at 06:55
  • Have a similar issue with another website. I get the first page's article URLs repeatedly scraped. Would need to your guidance. Here is the link - [https://stackoverflow.com/questions/45477874/beautiful-soup-unable-to-scrape-beyond-first-page] – Rrj17 Aug 03 '17 at 10:10