Scraping in Python with BeautifulSoup

Question

I've read quite a few posts here about this, but I'm very new to Python in general so I was hoping for some more info.

Essentially, I'm trying to write something that will pull word definitions from a site and write them to a file. I've been using BeautifulSoup, and I've made quite some progress, but here's my issue -

from __future__ import print_function
import requests
import urllib2, urllib
from BeautifulSoup import BeautifulSoup

wordlist = open('test.txt', 'a')

word = raw_input('Paste your word ')

url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word

# print url

html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)

print(visible_text, file=wordlist)

this seems to pull what I need, but puts it in this format

[u'passable\n     adj 1: able to be passed or traversed or crossed; &quot;the road is\n            passable&quot;

but I need it to be in plaintext. I've tried using a sanitizer (I was running it through bleach, but that didn't work. I've read some of the other answers here, but they don't explain HOW the code works, and I don't want to add something if I don't understand how it works.

Is there any way to just pull the plaintext?

edit: I ended up doing

from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup

wordlist = open('test.txt', 'a')

word = raw_input('Paste your word ')

url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word

# print url

html = urllib.urlopen(url).read()
# print html
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]

print(visible_text, file=wordlist)

So what output do you actually want? This is a list of unicode strings. If you want, you could say: "for temp in visible_text: print(temp)" as your last line. — Dr Xorile, Dec 08 '15 at 00:59
try to remove the chars u dont want take a look at [http://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string](http://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string) and [http://stackoverflow.com/questions/1038824/how-do-i-remove-a-substring-from-the-end-of-a-string-in-python](http://stackoverflow.com/questions/1038824/how-do-i-remove-a-substring-from-the-end-of-a-string-in-python) — r3v3r53, Dec 08 '15 at 01:13
@DrXorile The output I'm looking for is what is seen on the page - something i can format afterward. Ultimately, I want to pass in a list of words, have it get all their definitions, and print it to a file. — Josh, Dec 08 '15 at 01:17

score 1 · Answer 1 · answered Dec 08 '15 at 04:49

The code is already giving you plaintext, it just happens to have some characters encoded as entity references. In this case, special characters, which form part of the XML/HTML syntax are encoded to prevent them from breaking the structure of the text.

To decode them, use the HTMLParser module:

import HTMLParser
h = HTMLParser.HTMLParser()

h.unescape('&quot;the road is passable&quot;')
>>> u'"the road is passable"'

Scraping in Python with BeautifulSoup

1 Answers1