2

I decided to make this little project to learn how to use mechanize. For now it goes to urbandictionary, fills in the word 'skid' inside the search form and then press submit and prints out the HTML.

What I want it to do is to find the first definition and print that out. How would I exactly go and do that?

This is my source code so far:

import mechanize

br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")

br.select_form(nr=0)
br["term"] = "skid"
br.submit()

print br.response().read()

Here's where the definition is stored:

<div class="definition">Canadian definition: Commonly used to refer to someone   who      stopped evolving, and bathing, during the 80&#x27;s hair band era.  Generally can be found wearing AC/DC muscle shirts, leather jackets, and sporting a <a href="/define.php?term=mullet">mullet</a>.  The term &quot;skid&quot; is in part derived from &quot;skid row&quot;, which is both a band enjoyed by those the term refers to, as well as their address.  See also <a href="/define.php?term=white%20trash">white trash</a> and <a href="/define.php?term=trailer%20park%20trash">trailer park trash</a></div><div class="example">The skid next door got drunk and beat up his old lady.</div>

You can see it's stored inside the div definition. I know how to search for the div inside the source code but I don't know how to take everything that's between the tags and then display it.

Hooked
  • 84,485
  • 43
  • 192
  • 261
  • I'm not familiar with mechanize but anyway... my first thought is xpath (lxml) or beautifulsoup – Sheena Aug 23 '13 at 15:23
  • Look into [Scrapy](http://scrapy.org/) and [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) for that kind of task. And if the web site provides an API, that can be the best option. Urban Dictionary, for instance, seems to have a JSON API, but not freely available to anyone. – Paulo Almeida Aug 23 '13 at 15:24
  • Welcome to StackOverflow! Please look over the FAQ, it will help us help you. Typically you don't need a please or thank you, your upvote is a measure of that. Make sure that you accept an answer if it solves your problem. – Hooked Aug 23 '13 at 19:38

3 Answers3

1

I guess regular expression is sufficient for this task(based on your description). Try this code:

import mechanize, re

br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")

br.select_form(nr=0)
br["term"] = "skid"
br.submit()

source =  br.response().read()

regex = "<div class=\"definition\">(.+?)</div>"
pattern = re.compile(regex)
r=re.findall(pattern,source)
print r[0]

This will display the content between the tags(without the example part, but they are quite the same), but I don't know how you want to deal with tags within this content. If you want them there, that's it. Or if you want to remove them, you can use something like re.replace().

labyrlnth
  • 73
  • 4
  • 2
    I understand that you probably just did this as an example, but you really shouldn't use regex to match HTML. see [this classic answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Someone suggested beautifulsoup, which was designed for things like this. – Pete Tinkler Aug 23 '13 at 16:02
  • @PeteTinkler wow, that answer is really cool! Thanks for sharing. To be honest, I am not aware of this before, since it worked every time. I guess I need sometime to figure this out. Thanks:-) – labyrlnth Aug 23 '13 at 16:11
1

Since it was mentioned, I thought that I would provide a BeautifulSoup answer. Use what works best.

import bs4, urllib2

# Use urllib2 to get the html from the web
url     = r"http://www.urbandictionary.com/define.php?term={term}"
request = url.format(term="skid")
raw     = urllib2.urlopen(request).read()

# Convert it into a soup
soup    = bs4.BeautifulSoup(raw)

# Find the requested info
for word_def in soup.findAll(class_ = 'definition'):
    print word_def.string
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • This solution has problems if further elements like links are children of the printed element. To get the whole string, use word_def.text instead of 'word_def.string' in the last line. – mwil.me Nov 02 '13 at 16:48
0

You can use lxml to parse the HTML fragment:

import lxml.html as html
import mechanize

br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")

br.select_form(nr=0)
br["term"] = "skid"
br.submit()

fragment = html.fromstring(br.response().read())

print fragment.find_class('definition')[0].text_content()

This solution removes in tags inside the div and flattens the text, however.

mwil.me
  • 1,134
  • 1
  • 19
  • 33