9

I'm trying to extract specific classes from multiple URLs. The tags and classes stay the same but I need my python program to scrape all as I just input my link.

Here's a sample of my work:

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

url = input('insert URL here: ')
#scrape elements
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

#print titles only
h1 = soup.find("h1", class_= "class-headline")
print(h1.get_text())

This works for individual URLs but not for a batch. Thanks for helping me. I learned a lot from this community.

Rudolph Musngi
  • 165
  • 1
  • 1
  • 7

2 Answers2

11

Have a list of urls and iterate through it.

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....]
#scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    #print titles only
    h1 = soup.find("h1", class_= "class-headline")
    print(h1.get_text())

If you are going to prompt user for input for each site then it can be done this way

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....]
#scrape elements
msg = 'Enter Url, to exit type q and hit enter.'
url = input(msg)
while(url!='q'):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    #print titles only
    h1 = soup.find("h1", class_= "class-headline")
    print(h1.get_text())
    input(msg)
Falloutcoder
  • 991
  • 6
  • 19
  • I get this error: Traceback (most recent call last): File "/Users/Computer/Desktop/test.py", line 7, in urls = input['https://website.com/link1','https://website.com/link2'] TypeError: 'builtin_function_or_method' object is not subscriptable – Rudolph Musngi Nov 16 '16 at 10:39
  • Are you going to take input of each url from user? If no then simply put all urls in list, as showed in my answer. Dont put list in input method. – Falloutcoder Nov 16 '16 at 10:53
  • I was thinking of input from user separated by lines? – Rudolph Musngi Nov 16 '16 at 12:25
  • You mean you prompt user for input and user types the url and enter, it asks for next url and so on util user say proceed, only then start processing? – Falloutcoder Nov 16 '16 at 12:28
  • No. More like paste all the URLs and then it process and outputs everything. – Rudolph Musngi Nov 16 '16 at 12:35
  • Plus when I try to write it to a text file using these lines: text_file = open("Titles.txt", "w") text_file.write(h1.get_text()) text_file.close() it only gives me one result? – Rudolph Musngi Nov 16 '16 at 12:41
  • Ok, if thats the case then the code in above answer does the work. Put all urls in the list, like done in the answer for website1.com.etc. Use exact code, you should NOT use `input` now. As far as text file issue that is because you are opening the file in write mode as it overwrites previous data. Use append mode `a` instead of `w`. `text_file = open("Titles.txt", "a") ` – Falloutcoder Nov 16 '16 at 14:13
  • Thanks! It works. One more question though, if I wanted to say, ask the user to input the urls themselves, is there a workaround it? – Rudolph Musngi Nov 16 '16 at 18:44
  • Updated answer. – Falloutcoder Nov 16 '16 at 18:52
  • I add links separated by line likethis: http://website.com/page1 http://website.com/page2 http://website.com/page3 and so on, it only outputs one result. And when I try to export it to a txt file, I get an error: `Traceback (most recent call last): File "/Users/a/Desktop/test3.py", line 19, in text_file.write(h1.get_text()) UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 60: ordinal not in range(128)` – Rudolph Musngi Nov 17 '16 at 08:11
  • Use append mode to open the file. `text_file = open("Titles.txt", "a")` For unicode error check this http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20?answertab=active#tab-top – Falloutcoder Nov 17 '16 at 08:21
  • the print function aonly displays one result only. Nothing has changed. :) – Rudolph Musngi Nov 17 '16 at 11:35
  • I am not sure how you are doing it having a look at your current code. – Falloutcoder Nov 17 '16 at 19:17
  • If you wanted to wait to send the requests in lets say intervals of 10 seconds how would you do this? @dejavu_cmd_delt – ColeWorld Jan 15 '17 at 16:11
4

If you want to scrape links in batches. Specify a batch size and iterate over it.

from bs4 import BeautifulSoup
import requests
import pprint
import re
import pyperclip

batch_size = 5
urllist = ["url1", "url2", "url3", .....]
url_chunks = [urllist[x:x+batch_size] for x in xrange(0, len(urllist), batch_size)]

def scrape_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    h1 = soup.find("h1", class_= "class-headline")
    return (h1.get_text())

def scrape_batch(url_chunk):
    chunk_resp = []
    for url in url_chunk:
        chunk_resp.append(scrape_url(url))
    return chunk_resp

for url_chunk in url_chunks:
    print scrape_batch(url_chunk)
Ahwan Kumar
  • 103
  • 1
  • 7
  • If I wanted to space the requests to each url in intervals of 10 how could I do this? And I'm not familiar with the url chunks, what are their purpose? – ColeWorld Jan 15 '17 at 16:16
  • For spacing the requests import time and use time.sleep(10) in the scrape_url function. url_chunks is a variable which is a python list that contains a list of urls. for eg: [['www.website1.com', 'www.website2.com'],['www.website3.com','www.website3.com']] – Ahwan Kumar Jan 19 '17 at 06:54