0

I have a list with lots of links and I want to scrape them with beautifulsoup in Python 3

links is my list and it contains hundreds of urls. I have tried this code to scrape them all, but it's not working for some reason

 links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]

raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")
glhr
  • 4,439
  • 1
  • 15
  • 26

2 Answers2

0

raw should be a list containing the data of each web-page. For each entry in raw, parse it and create a soup object. You can store each soup object in a list (I called it soups):

links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']

raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
    soups.append(BeautifulSoup(page,'html.parser'))

You can then access eg. the soup object for the first link with soups[0].

Also, for fetching the response of each URL, consider using the requests module instead of urllib. See this post.

glhr
  • 4,439
  • 1
  • 15
  • 26
0

You need a Loop over the list links. If you have a lot of these to do, as mentioned in other answer, consider requests. With requests you can create a Session object which will allow you to re-use connection thereby more efficiently scraping

import requests
from bs4 import BeautifulSoup as bs

links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']

with requests.Session as s:
    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        #do something
QHarr
  • 83,427
  • 12
  • 54
  • 101