Extracting text from link in python

Question

I have a script in python 2.7 that scrapes the table in this page: http://www.the-numbers.com/movie/budgets/all

I want to extract each of the columns, the problem is that my code doesn't recognize the columns that have links (2nd and 3rd columns).

budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()
htmlpage = etree.HTML(s)
htmltable = htmlpage.xpath("//td[@class='data']/text()")

With this code htmltable[0] is the rank, htmltable[1] is Production Budget and continues from there onwards. From the ones I am missing, I need the text not the link.

Can you just grab the text without specifying `class='data'`? It looks like the other TDs have no class. — aghast, Apr 01 '17 at 17:25

score 1 · Answer 1 · edited May 23 '17 at 12:10

import urllib

budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

s = find_between(s, '<table>', '</table>')

print s[:500]
print '.............................................................'
print s[-250:]

Find string between two substrings

returns:

>>>
<tr><th>&nbsp;</th><th>Release Date</th><th>Movie</th><th>Production Budget</th><th>Domestic Gross</th><th>Worldwide Gross</th></tr>
<tr><td class="data">1</td>
<td><a href="/box-office-chart/daily/2009/12/18">12/18/2009</a></td>
<td><b><a href="/movie/Avatar#tab=summary">Avatar</a></td>
<td class="data">$425,000,000</td>
<td class="data">$760,507,625</td>
<td class="data">$2,783,918,982</td>
<tr>
<tr><td class="data">2</td>
<td><a href="/box-office-chart/daily/2015/12/18">12/18/2015</a></td>
.............................................................
</td>
<td><a href="/box-office-chart/daily/2005/08/05">8/5/2005</a></td>
<td><b><a href="/movie/My-Date-With-Drew#tab=summary">My Date With Drew</a></td>
<td class="data">$1,100</td>
<td class="data">$181,041</td>
<td class="data">$181,041</td>
<tr>

.........................................

I need the text not the link.

via http://www.convertcsv.com/html-table-to-csv.htm

Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
1,12/18/2009,Avatar,"$425,000,000","$760,507,625","$2,783,918,982"
8/5/2005,My Date With Drew,"$1,100","$181,041","$181,041"

you can use beautifulsoup to do the same, see:

beautifulSoup html csv

vold · Accepted Answer · 2017-04-02T00:17:17.357

1

You need to amend your xpath since not all td elements have class="data". Try this xpath expression: //td//text().

import urllib
from lxml import etree

budgeturl = "http://www.the-numbers.com/movie/budgets/all"
s = urllib.urlopen(budgeturl).read()
htmlpage = etree.HTML(s)
htmltable = htmlpage.xpath("//td//text()")

Output:

edited Apr 02 '17 at 00:17

answered Apr 01 '17 at 23:35

vold

1,549
1
13
19

Extracting text from link in python

2 Answers2