1

I am doing some web scraping on the Rotten Tomatoes website, for example here.

I am using Python with the Beautiful Soup and lxml modules together.

I want to extract the movie info, for example: - Genre: Drama, Musical & Performing Arts

  • Directed By: Kirill Serebrennikov

  • Written By: Mikhail Idov, Lili Idova, Ivan Kapitonov, Kirill Serebrennikov, Natalya Naumenko

  • Written by (links): /celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko

I inspected the page html to get the guidelines on the paths:

                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Rating: </div>
                        <div class="meta-value">NR</div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Genre: </div>
                        <div class="meta-value">

                                <a href="/browse/opening/?genres=9">Drama</a>, 

                                <a href="/browse/opening/?genres=12">Musical &amp; Performing Arts</a>

                        </div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Directed By: </div>
                        <div class="meta-value">

                                <a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>

                        </div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Written By: </div>
                        <div class="meta-value">

                                <a href="/celebrity/michael_idov">Mikhail Idov</a>, 

                                <a href="/celebrity/lily_idova">Lili Idova</a>, 

                                <a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>, 

                                <a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>, 

                                <a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>

                        </div>
                    </li>


                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">In Theaters: </div>
                        <div class="meta-value">
                            <time datetime="2019-06-06T17:00:00-07:00">Jun 7, 2019</time>
                            <span style="text-transform:capitalize">&nbsp;limited</span>
                        </div>
                    </li>




                    <li class="meta-row clearfix">
                        <div class="meta-label subtle">Runtime: </div>
                        <div class="meta-value">
                            <time datetime="P126M">
                                126 minutes
                            </time>
                        </div>
                    </li>


                    <li class="meta-row clearfix">
                    <div class="meta-label subtle">Studio: </div>
                    <div class="meta-value">

                            <a href="http://sonypictures.ru/leto/" target="movie-studio">Gunpowder &amp; Sky</a>

                    </div>

            </li>

I created the html objects like this:

    page_response = requests.get(url, timeout=5)
    page_content = BeautifulSoup(page_response.content, "html.parser")
    tree = html.fromstring(page_response.content)

For the Writer, for example, as I only need the text on the element, it fairly easy to get:

page_content.select('div.meta-value')[3].getText()

Or using the xpart for the Rating:

tree.xpath('//div[@class="meta-value"]/text()')[0]

For the desired Writer Links, where I have the issue, to access the html chunk I do this:

page_content.select('div.meta-value')[3]

Which gives:

<div class="meta-value">
<a href="/celebrity/michael_idov">Mikhail Idov</a>, 

                                <a href="/celebrity/lily_idova">Lili Idova</a>, 

                                <a href="/celebrity/ivan_kapitonov">Ivan Kapitonov</a>, 

                                <a href="/celebrity/kirill_serebrennikov">Kirill Serebrennikov</a>, 

                                <a href="/celebrity/natalya_naumenko">Natalya Naumenko</a>

Or:

tree.xpath('//div[@class="meta-value"]')[3]

Giving:

<Element div at 0x2915a4c54a8>

The problem is that I can't extract the 'href'. The output I want is:

/celebrity/michael_idov, /celebrity/lily_idova, /celebrity/ivan_kapitonov, /celebrity/kirill_serebrennikov, /celebrity/natalya_naumenko

I have tried:

page_content.select('div.meta-value')[3].get('href')
tree.xpath('//div[@class="meta-value"]')[3].get('href')
tree.xpath('//div[@class="meta-value"]/@href')[3]

All with a null or error result. Could anyone help me out on this?

Thanks in advance! Cheers!

spcvalente
  • 136
  • 3
  • 14
  • Check out this answer: https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href – kaleoh Jul 02 '19 at 19:55
  • Thanks. That returns all the hrefs in the page, however I just want those on the section page_content.select('div.meta-value')[3] Any tip for this? I tried, with no success, something like: for a in page_content.select('div.meta-value')[2]: print("Found the URL:", a['href']) – spcvalente Jul 02 '19 at 20:05

1 Answers1

2

Try the following scripts to get the content you are interested in. Make sure to test both of them by using different movies. I suppose they both will produce the desired output. I tried to avoid any hardcoded indices to target the content.

Using css selector:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')

directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)

Using xpath:

import requests
from lxml.html import fromstring

r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)

directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)

In case of cast, I used list comprehension so that I can use .strip() on individual element to kick out whitespaces. normalize-space() is the ideal option for this, though.

cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]
SIM
  • 21,997
  • 5
  • 37
  • 109
  • The xpath work flawlessly, thanks so much. Anyway I have no clue how you got it. Any short (I don't want to ask too much) explanation of the syntax or some resource you would advise to check? Thanks again! – spcvalente Jul 02 '19 at 21:51
  • Initially check out the content of [this link](https://www.swtestacademy.com/xpath-selenium/) to get a basic understanding as to how you can create relative xpaths. – SIM Jul 02 '19 at 22:04
  • SIM, if it's not asking too much, would you help me with one more, please? I really couldn't understand the syntax. To get the list of cast (Teo Yoo, Irina Starshenbaum , etc.) I looked hard at the html and the guide you sent me but I couldn't come up with something useful. Would you please give me a hand in here? I really tried. Thanks! – spcvalente Jul 02 '19 at 23:09
  • Edited to include cast @spcvalente . – SIM Jul 03 '19 at 05:40
  • You ARE great indeed! I would gladly pay you a beer. I think with this examples I'll be able to adapt and use it for other pages. Thank you ever so much. – spcvalente Jul 03 '19 at 09:21