Scrape URLs using BeautifulSoup in Python 3

Question

I tried this code but the list with the URLs stays empty. No error massage, nothing.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, features="xml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^https://www.metacritic.com/movie/")}):
    links.append(link.get('href'))

print(links)

I want to scrape all the URLs that start with "https://www.metacritic.com/movie/" that are found in the given URL "https://www.metacritic.com/browse/movies/genre/date?page=0".

What am I doing wrong?

score 4 · Accepted Answer · answered Dec 24 '18 at 10:02

4

First you should use the standard library "html.parser" instead of "xml" for parsing the page content. It deals better with broken html (see Beautiful Soup findAll doesn't find them all)

Then take a look at the source code of the page you are parsing. The elements you want to find look like this: <a href="/movie/woman-at-war">

So change your code like this:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
    links.append(link.get('href'))

print(links)

answered Dec 24 '18 at 10:02

leiropi

404
1
4
17

Many thanks. One additional question, because it seems that you are good at regex: How can I omit all URLs that are not like "/movie/movie-name" e.g. "/movie/movie-name/trailers". I tried "re.compile("^/movie/.+[^\/]")" but he keeps all the unwanted URLs. – TAN-C-F-OK Dec 24 '18 at 10:35
1

You could use a regex like "^/movie/([a-zA-Z0-9\-])+$" to match links only containing letters, numbers and minuses after "/movie/" – leiropi Dec 24 '18 at 11:09
1

`if '/trailers/' not in link.get('href'): links.append(link.get('href'))` – gosuto Dec 24 '18 at 12:10

score 2 · Answer 2 · answered Dec 24 '18 at 10:01

2

Your code is sound.

The list stays empty because there aren't any URLs on that page matching that pattern. Try re.compile("^/movie/") instead.

answered Dec 24 '18 at 10:01

gosuto

5,422
6
36
57

1

@leiropi is right, `features="xml"` is also giving you problems. `soup = BeautifulSoup(html_page, 'lxml')` does give the right results though. – gosuto Dec 24 '18 at 10:34

Scrape URLs using BeautifulSoup in Python 3

2 Answers2