1

Im making a small python script for auto logon to a website. But i'm stuck.

I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:

<td class=h3 align='right'>&nbsp;&nbsp;John Appleseed</td><td>&nbsp;<a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>

But how do I extract and print just the name, John Appleseed?

I'm using Pythons' Mechanize on a mac, by the way.

Conor Taylor
  • 2,998
  • 7
  • 37
  • 69

3 Answers3

7

Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)

Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:

html = ... # this is the html you've fetched

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element
Community
  • 1
  • 1
Rabarberski
  • 23,854
  • 21
  • 74
  • 96
  • For fetching html. why not just use urllib.urlopen(). Personally, I have never used mechanize because I never felt the need of it. – shadyabhi Oct 14 '11 at 07:51
  • 1
    @shadyabhi: `urllib` is good as well, but it depends on your needs. I find mechanize useful when you have to deal with a proxy, or session state, or fetch or fill in forms, ... – Rabarberski Oct 14 '11 at 07:54
1

As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.

But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?

You can obtain the xpath for an element using "inspect element" feature of firefox.

For ex, if you want to find the XPATH for username in stackoverflow site.

  • Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
  • Keep your mouse over tag or right click it and "Copy xpath".

enter image description here

Community
  • 1
  • 1
shadyabhi
  • 16,675
  • 26
  • 80
  • 131
1

You can use a parser to extract any information in a document. I suggest you to use lxml module.

Here you have an example:

from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO("""<td class=h3 align='right'>&nbsp;&nbsp;John Appleseed</td><td>&nbsp;<a href="members_myaccount.php"><img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></a></td>"""),parser)


>>> tree.xpath("string()").strip()
u'John Appleseed'

More information about lxml here

Diego Navarro
  • 9,316
  • 3
  • 26
  • 33
  • hmmm... What happens when the name changes? I want to be able to log in as anyone off this script, not just John Appleseed – Conor Taylor Oct 14 '11 at 06:28
  • You can put any name you want in that tag: `>>> tree = etree.parse(StringIO("""  Foo Bar My Account"""),parser) >>> tree.xpath("string()").strip() u'Foo Bar'` – Diego Navarro Oct 14 '11 at 06:30
  • No, the name changes depending on what user you are logged in as; I dont own or have access to the site – Conor Taylor Oct 14 '11 at 21:38