0

Hi I'm practicing the regular expression with Python to parse the titles of top250 movies from IMDb but I am having difficulties to search contents between two tags like: The Godfather

import re, urllib.request
def movie(url):
    web_page = urllib.request.urlopen(url)
    lines = web_page.read().decode(errors = "replace")
    web_page.close()
    return re.findall('(?<=<a href=")/title.*?">.+?(?=</a>)', lines, re.DOTALL)
title = movie("https://www.imdb.com/search/title?groups=top_250&sort=user_rating")
for name in title:
    print(name)
allendom
  • 13
  • 4
  • https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – C.Nivs Feb 27 '19 at 21:18
  • To expand on why I linked that question, parse the html with a parser like `lxml` or `beautifulSoup`, extract what you want, *then* use regex – C.Nivs Feb 27 '19 at 21:19

2 Answers2

2

As pointed in the comments, you better give a try on BeautifulSoup. Something like this will list the titles, in Python3:

import requests
from bs4 import BeautifulSoup

html = requests.get('https://www.imdb.com/search/title?groups=top_250&sort=user_rating')
if html.ok:
    soup = BeautifulSoup(html.text, 'html.parser')
    html.close()

for title in soup('h3', 'lister-item-header'):
    print(title('a')[0].get_text())

And here is a cleaner version of the code above:

import requests
from bs4 import BeautifulSoup

imdb_entry_point = 'https://www.imdb.com/search/title'
imdb_payload = {
    'groups': 'top_250',
    'sort': 'user_rating'
}

with requests.get(imdb_entry_point, imdb_payload) as imdb:
    if imdb.ok:
        html = BeautifulSoup(imdb.text, 'html.parser')
        for i, h3 in enumerate(html('h3', 'lister-item-header'), 1):
            for a in h3('a'):
                print(i, a.get_text())

BTW, that entry point is returning just 50 results and not 250 as you are expecting.

accdias
  • 5,160
  • 3
  • 19
  • 31
0

here is a working solution, using both BeautifulSoup and some nasty regex, but it's working fine. I love regex but it seems that I make them in a weird way, I can explaine to you how they works if you want.

import re, urllib.request
from bs4 import BeautifulSoup

url = "https://www.imdb.com/search/title?groups=top_250&sort=user_rating"
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
i = 0
for txt in soup.findAll(attrs={"class" :"lister-item-header"}):
    i += 1
    print(str(i) + " ." + re.match("""^.*>(.*)</a>.*$""", re.sub('"', '', re.sub('\n', '', str(txt)))).group(1))

My output : (it's french...)

  1. Les évadés

  2. Le parrain

  3. The Dark Knight: Le chevalier noir

  4. Le parrain, 2ème partie

  5. Le seigneur des anneaux: Le retour du roi

And the list goes on...

Lyxilion
  • 53
  • 4