How to parse the top 250 movie title using Regular expression in Python

Question

Hi I'm practicing the regular expression with Python to parse the titles of top250 movies from IMDb but I am having difficulties to search contents between two tags like: The Godfather

import re, urllib.request
def movie(url):
    web_page = urllib.request.urlopen(url)
    lines = web_page.read().decode(errors = "replace")
    web_page.close()
    return re.findall('(?<=<a href=")/title.*?">.+?(?=</a>)', lines, re.DOTALL)
title = movie("https://www.imdb.com/search/title?groups=top_250&sort=user_rating")
for name in title:
    print(name)

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — C.Nivs, Feb 27 '19 at 21:18
To expand on why I linked that question, parse the html with a parser like `lxml` or `beautifulSoup`, extract what you want, *then* use regex — C.Nivs, Feb 27 '19 at 21:19

accdias · Accepted Answer · 2019-03-01T15:04:17.930

As pointed in the comments, you better give a try on BeautifulSoup. Something like this will list the titles, in Python3:

import requests
from bs4 import BeautifulSoup

html = requests.get('https://www.imdb.com/search/title?groups=top_250&sort=user_rating')
if html.ok:
    soup = BeautifulSoup(html.text, 'html.parser')
    html.close()

for title in soup('h3', 'lister-item-header'):
    print(title('a')[0].get_text())

And here is a cleaner version of the code above:

import requests
from bs4 import BeautifulSoup

imdb_entry_point = 'https://www.imdb.com/search/title'
imdb_payload = {
    'groups': 'top_250',
    'sort': 'user_rating'
}

with requests.get(imdb_entry_point, imdb_payload) as imdb:
    if imdb.ok:
        html = BeautifulSoup(imdb.text, 'html.parser')
        for i, h3 in enumerate(html('h3', 'lister-item-header'), 1):
            for a in h3('a'):
                print(i, a.get_text())

BTW, that entry point is returning just 50 results and not 250 as you are expecting.

score 0 · Answer 2 · answered Feb 27 '19 at 22:30

here is a working solution, using both BeautifulSoup and some nasty regex, but it's working fine. I love regex but it seems that I make them in a weird way, I can explaine to you how they works if you want.

import re, urllib.request
from bs4 import BeautifulSoup

url = "https://www.imdb.com/search/title?groups=top_250&sort=user_rating"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
i = 0
for txt in soup.findAll(attrs={"class" :"lister-item-header"}):
    i += 1
    print(str(i) + " ." + re.match("""^.*>(.*)</a>.*$""", re.sub('"', '', re.sub('\n', '', str(txt)))).group(1))

My output : (it's french...)

Les évadés
Le parrain
The Dark Knight: Le chevalier noir
Le parrain, 2ème partie
Le seigneur des anneaux: Le retour du roi

And the list goes on...

How to parse the top 250 movie title using Regular expression in Python

2 Answers2